




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
半結(jié)構(gòu)化文半結(jié)構(gòu)化文本挖掘方楊建北京大學(xué)計算機(jī)科學(xué)技術(shù)研究1Text-centricXMLDocumentsText-centricXMLDocumentsmarkedupasE.g.,assemblymanuals,journalQueriesareuserinformationE.g.,givemetheSection(element)ofedocumentthattellsmehowtochangeabrakelightDifferentfromwell-structuredXMLquerieswhereyoutightlyspecifywhatyou’relooking2VectorspacesandVectorspacesandVectorspaces–tried+testedframeworkforkeywordretrievalOther“bagofwords”applicationsinclassification,clustering…Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.3VectorspacesandForinstance,distinguishbetweenVectorspacesandForinstance,distinguishbetweenfollowingtwoThePearlyMicrosoftBillBill4Content-richXML:MicrosoftTheContent-richXML:MicrosoftTheLexicon5EncodingtheGatesEncodingtheGatesWhataretheaxesofthevectorIntextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree6Beforeaddressingthis,letustheBeforeaddressingthis,letusthekindsofquerieswewanttoMicrosoft7QueryTheprecedingQueryTheprecedingexamplescanbeviewedassubtreesofthedocumentButwhat(GatessomewhereunderneathThisisharderandwewillreturntoit8SubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftSubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftMicrosoft…MicrosoftMicrosoft9StructuralCalleachoftheresulting(8+,inpreviousStructuralCalleachoftheresulting(8+,inpreviousslide)subtreesastructuralNotethatstructuraltermsmightoccurmultipletimesinadocumentCreateoneaxisinthevectorspaceforeachdistinctstructuraltermWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remainExampleoftfToExampleoftfTobeortoExercise:HowmanyaxesarethereinthisHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightIdea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsomeg<Hamlet=0.8Forthedoconthepreviousslide,theHamletismultipliedbyYorickismultipliedbyinanystructuraltermrootedatThenumberofThenumberofstructuralCanbeAlright,howhuge,ImpracticaltobuildavectorindexwithsomanyWillexaminepragmaticsolutionstothisshortly;fornow,continuetobelieve…Structuralterms:Structuralterms:Thenotionofstructuraltermsisindependentofanyschema/DTDfortheXMLdocumentsWell-suitedtoaheterogeneouscollectionofXMLEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasaAllowsweightingportionsoftheExample…Example…WeightTheWeightTheassignmentoftheweights0.6and0.4inthepreviousexampletosubtreeswasCanbemoreThinkofitasgeneratedbyanapplication,notnecessarilyanend-userQueries,documentsbecomenormalizedRetrievalscorecomputation“just”amatterofcosinesimilaritycomputationRestrictstructuralRestrictstructuralDependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesThecatchThisisThecatchThisisallverypromising,butHowbigisthisvectorCanbeexponentiallylargeinthesizeoftheCannothopetobuildsuchanAndinanycase,stillfailstoanswerqueriesTwoQuery-timeTwoQuery-timematerializationofRestrictthekindsofsubtreestoamanageablesetQuery-timeInsteadofenumeratingallstructuraltermsofalldocsQuery-timeInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequeryThelatterishopefullyasmallNow,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanyThisistreepatternmatching:givenatexttreeandapatterntree,findmatchesExceptwehavemanytextOurtreesarelabeledandTextHereweseekadocwithHamletintheTextHereweseekadocwithHamletintheOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletQueryHamlet(StillAdoc(StillAdocwithYoricksomewhereinQueryWillgettoitRestrictingtheRestrictingtheEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeally–focusonsubtreeslikelytoariseinJuruXML(IBMOnlypathsincludingalexicontermInthisJuruXML(IBMOnlypathsincludingalexicontermInthisexamplethereareonly14(why?)suchpathsThuswehave14structuraltermsintheHamletTobeortoWhyisthisfarmoreHowbigcantheindexbeasafunctionoftheCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindexCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindextime–areawithlittleresearchsofarMicrosoft2MicrosoftWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayMicrosoftReturntothedescendantReturntothedescendantNoknownQueryseeksGatesunderinthevectorDeviseamatchfunctioninthevectorDeviseamatchfunctionthatyieldsascorein[0,1]betweenstructuraltermsE.g.,whenthestructuraltermsarepaths,measureThegreatertheoverlap,thehigherthematchCanadjustmatchforwheretheoverlapHowdoweHowdoweusethisinFirstenumeratestructuraltermsintheMeasureeachformatchagainstthedictionaryofstructuraltermsJustlikeapostingslookup,exceptnotBoolean(doesthetermexist)Instead,produceascorethatsays“80%closetothisstructuralterm”,etc.Then,retrievedocswiththatstructuralterm,computecosinesimilarities,etc.ExampleofaretrievalMatchST=ExampleofaretrievalMatchST=StructuralNowranktheDoc’sbycosinesimilarity;e.g.,Doc9scores0.578.ClosingButwhatexactlyisaClosingButwhatexactlyisaInasense,anentirecorpuscanbeviewedasanXMLdocumentWhatareWhataretheDoc’sintheAnythingwearepreparedtoreturnasanCouldbenodes,someoftheirchildrenWhatareWhatarequerieswecan’thandleusingvectorspaces?FindfiguresthatdescribetheCorbaarchitectureandtheparagraphsthatrefertothosefiguresRequiresJOINbetween2RetrievethetitlesofarticlespublishedintheSpecialFeaturesectionofthejournalIEEEMicroDependsonorderofsiblingCanwedoCanwedoYes,butdoesn’tmakesensetodoitcorpus-Candoit,forinstance,withinalltextunderacertainelementnamesayChapterYieldsatf-idfweightforeachlexicontermunderanelementIssues:howdowepropagatecontributionstohigherlevelnodes.SayGateshashighIDFundertheAuthorHowSayGateshashighIDFundertheAuthorHowshoulditbetf-idfweightedfortheBookShouldweusetheidfforGatesinAuthororthatinBook?SQLforSQLforUsageHuman-readableData-orientedMixeddocuments(e.g.,patientReliesXMLSchemaTuringXQueryisstillaworkingTheprincipalTheprincipalformsofXQueryexpressionspathelementFLWR("flower")listdatatypeEvaluatedwithrespecttoaFOR$pINdocument("bib.xml")//publisherLETFOR$pINdocument("bib.xml")//publisherLET$b:=document("bib.xml”)//book[publisher=$p]WHEREcount($b)>100RETURN$pFORgeneratesanorderedlistofbindingsofpublishernamesto$pLETassociatestoeachbindingafurtherbindingofthelistofbookelementswiththatpublisherto$batthisstage,wehaveanorderedlistoftuplesofbindings:WHEREfiltersthatlisttoretainonlythedesiredRETURNconstructsforeachtuplearesultingQueriesSupportedbyQueriesSupportedbyLocation/position(“chapterSimple/play/titlecontainsPathtitlecontains/play//titlecontainsComplexEmployeeswithtwoSubsumes:WhataboutrelevanceHowXQueryHowXQuerymakesAlldocumentsinsetAmustberankedabovealldocumentsinsetB.Fragmentsmustbeorderedindepth-first,left-to-rightorder.XQuery:OrderByXQuery:OrderByfor$dinlet$e:=document("emps.xml")//emp[deptno=$d]wherecount($e)>=10orderbyavg($e/salary)descendingreturn<big-dept>{$d,XQuery:OrderXQuery:OrderByOrderbyclauseonlyallowsorderingbySaybyanattributeRelevanceIsoftenCan’tbeexpressedeasilyasfunctionofsettobeIsbetterabstractedoutofqueryformulation(cf.UniversityofUniversityofGoal:opensourceXMLsearch“Returnable”fragmentsareE.g.,don’treturna<bold>sometext</bold>StructuredDocumentRetrievalEmpoweruserswhodon’tknowtheEnablesearchforanypersonnomatterhowschemaencodesthedataDon’tworryaboutAtomicSpecifiedAtomicSpecifiedinOnlyatomicunitscanbereturnedasresultofsearch(unlessunitspecified)Tf.idfweightingisappliedtoatomicProbabilisticcombinationof“evidence”fromatomicunitsXIRQLXIRQLAsystemshouldalwaysretrievethemostspecificpartofadocumentansweringaquery.Examplequery:<chapter>0.3<section>0.8XQL0.7syntaxReturnsection,notAugmentationEnsureAugmentationEnsurethatStructuredDocumentRetrievalPrincipleisrespected.Assumedifferentqueryconditionsaredisjointevents->independence.er)*P(XQL|section)–n)=0.3+0.6*0.8-0.3*0.6*0.8=0.636SectionrankedaheadofExample:AssignExample:AssignallelementsandattributeswithpersonsemanticstothisdatatypeAllowusertosearchforwithoutspecifyingXIRQL:RelevanceXIRQL:RelevanceFragment/contextDatatypesSemanticXMLXMLNativeXMLNativeXMLUsesXMLdocumentaslogicalShouldPCDATA(parsedcharacterDocumentContrastDBmodifiedforGenericIRsystemmodifiedforXMLIndexingandMostnativeXMLIndexingandMostnativeXMLdatabasestakenaDBNoIRtyperelevanceOnlyafewthatfocusonrelevanceDatavs.Text-centricDatavs.Text-centricData-centricXML:usedformessagingbetweenenterpriseapplicationsMainlyarecastingofrelationalContent-centricXML:usedforannotatingRichinDemandsgoodintegrationoftextretrievalE.g.,findmetheISBN#sofBookswithatleastthreeChaptersdiscussingcocoaproduction,rankedbyPriceDatastructuresDatastructuresforXMLAverybasicDatastructuresforDatastructuresforXMLWhataretheprimitivesweInvertedindex:givemeallelementsmatchingtextqueryQWeknowhowtodothis–treateachelementasadocumentGivemeallelements(immediately)belowanyinstanceoftheBookelementCombinationoftheParent/childNumbereachParent/childNumbereachMaintainalistofparent-childE.g.,Chapter:21EnablesimmediateButwhatabout“thewordHamletunderSceneelementunderaPlayGeneralpositionalViewtheXMLdocumentasatextGeneralpositionalViewtheXMLdocumentasatextBuildapositionalindexforeachMarkthebeginningandendforeachelement,PositionaldroppethunderVersePositionaldroppethunderVerseunderPl6y.SummaryofdataSummaryofdataPathcontainmentetc.canessentiallybesolvedbypositionalinvertedindexesRetrievalconsistsof“merging”Allthecompressiontricksetc.from276AarestillComplicationsarisefrominsertion/deletionofelements,textwithinelementsBeyondthescopeofthisINEX:aINEX:abenchmarkfortext-XMLBenchmarkforBenchmarkfortheevaluationofXMLAnalogofTREC(recallConsistsSetofXMLCollectionofretrievalEachengineindexesEachengineindexesEngineteamconvertsretrievaltasksintoInXMLquerylanguageunderstoodbyInresponse,theengineretrievesnotdocs,butelementswithindocsEngineranksretrievedINEXForINEXForeachquery,eachretrievedelementishuman-assessedontwomeasures:Relevance–howrelevantistheretrievedCoverage–istheretrievedelementtoospecific,toogeneral,orjustrightE.g.,ifthequeryseeksadefinitionoftheFastFourierTransform,doIgettheequation(toospecific),thechaptercontainingthedefinition(toogeneral)orthedefinitionitselfTheseassessmentsareturnedintocompositeprecision/recallmeasuresINEX12,107INEX12,107articlesfromIEEESociety494Averagearticle:1,532XMLAveragenodedepth=INEXEachINEXEachtopicisaninformationneed,oneoftwokinds:ContentOnly(CO)–freetextContentandStructure(CAS)–structuralconstraints,e.g.,containmentSampleINEXCOSampleINEXCO<Title>computationalbiology<Keywords>computationalbiology,bioinformatics,genome,genomics,proteomics,sequencing,proteinfolding<Description>Challengesthatarise,andapproachesbeingexplored,intheinterdisciplinaryfieldofcomputational<Narrative>Toberelevant,adocument/componentmusteithe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- T-ZSA 278-2024 軌道交通.用銑磨機(jī)器人
- 2025年度高新技術(shù)企業(yè)員工離職競業(yè)限制補(bǔ)償金合同
- 二零二五年度教育行業(yè)人才招聘定金協(xié)議
- 二零二五年度金融機(jī)構(gòu)間反洗錢合作協(xié)議
- 2025年度金融項目評審合同風(fēng)險控制
- 二零二五商場合同管理操作手冊附小時計費(fèi)服務(wù)條款
- 2025年度環(huán)保產(chǎn)業(yè)合作開發(fā)合伙協(xié)議書
- 二零二五年度供用熱力合同糾紛司法解釋及執(zhí)行難點(diǎn)解析
- 二零二五年度超市促銷活動商品陳列策劃合同
- 2025沈陽公司總經(jīng)理聘用合同全面規(guī)范管理細(xì)則
- 醫(yī)務(wù)人員醫(yī)德醫(yī)風(fēng)培訓(xùn)
- 人教版初中歷史八上-第2課 第二次鴉片戰(zhàn)爭
- 黑龍江省哈爾濱市2024年高三一模試題(數(shù)學(xué)試題理)試題
- 全國計算機(jī)等級考試一級試題及答案(5套)
- 公司安全事故隱患內(nèi)部舉報、報告獎勵制度
- 產(chǎn)品方案設(shè)計模板
- 部隊通訊員培訓(xùn)
- 2024-2030年中國企業(yè)在安哥拉投資建設(shè)化肥廠行業(yè)供需狀況及發(fā)展風(fēng)險研究報告版
- 物業(yè)公司水浸、水管爆裂事故應(yīng)急處置預(yù)案
- 河南省公務(wù)員面試真題匯編7
- SF-T0095-2021人身損害與疾病因果關(guān)系判定指南
評論
0/150
提交評論