挖掘技術(shù)教程_第1頁
挖掘技術(shù)教程_第2頁
挖掘技術(shù)教程_第3頁
挖掘技術(shù)教程_第4頁
挖掘技術(shù)教程_第5頁
已閱讀5頁,還剩71頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

半結(jié)構(gòu)化文半結(jié)構(gòu)化文本挖掘方楊建北京大學(xué)計算機(jī)科學(xué)技術(shù)研究1Text-centricXMLDocumentsText-centricXMLDocumentsmarkedupasE.g.,assemblymanuals,journalQueriesareuserinformationE.g.,givemetheSection(element)ofedocumentthattellsmehowtochangeabrakelightDifferentfromwell-structuredXMLquerieswhereyoutightlyspecifywhatyou’relooking2VectorspacesandVectorspacesandVectorspaces–tried+testedframeworkforkeywordretrievalOther“bagofwords”applicationsinclassification,clustering…Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.3VectorspacesandForinstance,distinguishbetweenVectorspacesandForinstance,distinguishbetweenfollowingtwoThePearlyMicrosoftBillBill4Content-richXML:MicrosoftTheContent-richXML:MicrosoftTheLexicon5EncodingtheGatesEncodingtheGatesWhataretheaxesofthevectorIntextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree6Beforeaddressingthis,letustheBeforeaddressingthis,letusthekindsofquerieswewanttoMicrosoft7QueryTheprecedingQueryTheprecedingexamplescanbeviewedassubtreesofthedocumentButwhat(GatessomewhereunderneathThisisharderandwewillreturntoit8SubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftSubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftMicrosoft…MicrosoftMicrosoft9StructuralCalleachoftheresulting(8+,inpreviousStructuralCalleachoftheresulting(8+,inpreviousslide)subtreesastructuralNotethatstructuraltermsmightoccurmultipletimesinadocumentCreateoneaxisinthevectorspaceforeachdistinctstructuraltermWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remainExampleoftfToExampleoftfTobeortoExercise:HowmanyaxesarethereinthisHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightIdea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsomeg<Hamlet=0.8Forthedoconthepreviousslide,theHamletismultipliedbyYorickismultipliedbyinanystructuraltermrootedatThenumberofThenumberofstructuralCanbeAlright,howhuge,ImpracticaltobuildavectorindexwithsomanyWillexaminepragmaticsolutionstothisshortly;fornow,continuetobelieve…Structuralterms:Structuralterms:Thenotionofstructuraltermsisindependentofanyschema/DTDfortheXMLdocumentsWell-suitedtoaheterogeneouscollectionofXMLEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasaAllowsweightingportionsoftheExample…Example…WeightTheWeightTheassignmentoftheweights0.6and0.4inthepreviousexampletosubtreeswasCanbemoreThinkofitasgeneratedbyanapplication,notnecessarilyanend-userQueries,documentsbecomenormalizedRetrievalscorecomputation“just”amatterofcosinesimilaritycomputationRestrictstructuralRestrictstructuralDependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesThecatchThisisThecatchThisisallverypromising,butHowbigisthisvectorCanbeexponentiallylargeinthesizeoftheCannothopetobuildsuchanAndinanycase,stillfailstoanswerqueriesTwoQuery-timeTwoQuery-timematerializationofRestrictthekindsofsubtreestoamanageablesetQuery-timeInsteadofenumeratingallstructuraltermsofalldocsQuery-timeInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequeryThelatterishopefullyasmallNow,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanyThisistreepatternmatching:givenatexttreeandapatterntree,findmatchesExceptwehavemanytextOurtreesarelabeledandTextHereweseekadocwithHamletintheTextHereweseekadocwithHamletintheOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletQueryHamlet(StillAdoc(StillAdocwithYoricksomewhereinQueryWillgettoitRestrictingtheRestrictingtheEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeally–focusonsubtreeslikelytoariseinJuruXML(IBMOnlypathsincludingalexicontermInthisJuruXML(IBMOnlypathsincludingalexicontermInthisexamplethereareonly14(why?)suchpathsThuswehave14structuraltermsintheHamletTobeortoWhyisthisfarmoreHowbigcantheindexbeasafunctionoftheCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindexCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindextime–areawithlittleresearchsofarMicrosoft2MicrosoftWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayMicrosoftReturntothedescendantReturntothedescendantNoknownQueryseeksGatesunderinthevectorDeviseamatchfunctioninthevectorDeviseamatchfunctionthatyieldsascorein[0,1]betweenstructuraltermsE.g.,whenthestructuraltermsarepaths,measureThegreatertheoverlap,thehigherthematchCanadjustmatchforwheretheoverlapHowdoweHowdoweusethisinFirstenumeratestructuraltermsintheMeasureeachformatchagainstthedictionaryofstructuraltermsJustlikeapostingslookup,exceptnotBoolean(doesthetermexist)Instead,produceascorethatsays“80%closetothisstructuralterm”,etc.Then,retrievedocswiththatstructuralterm,computecosinesimilarities,etc.ExampleofaretrievalMatchST=ExampleofaretrievalMatchST=StructuralNowranktheDoc’sbycosinesimilarity;e.g.,Doc9scores0.578.ClosingButwhatexactlyisaClosingButwhatexactlyisaInasense,anentirecorpuscanbeviewedasanXMLdocumentWhatareWhataretheDoc’sintheAnythingwearepreparedtoreturnasanCouldbenodes,someoftheirchildrenWhatareWhatarequerieswecan’thandleusingvectorspaces?FindfiguresthatdescribetheCorbaarchitectureandtheparagraphsthatrefertothosefiguresRequiresJOINbetween2RetrievethetitlesofarticlespublishedintheSpecialFeaturesectionofthejournalIEEEMicroDependsonorderofsiblingCanwedoCanwedoYes,butdoesn’tmakesensetodoitcorpus-Candoit,forinstance,withinalltextunderacertainelementnamesayChapterYieldsatf-idfweightforeachlexicontermunderanelementIssues:howdowepropagatecontributionstohigherlevelnodes.SayGateshashighIDFundertheAuthorHowSayGateshashighIDFundertheAuthorHowshoulditbetf-idfweightedfortheBookShouldweusetheidfforGatesinAuthororthatinBook?SQLforSQLforUsageHuman-readableData-orientedMixeddocuments(e.g.,patientReliesXMLSchemaTuringXQueryisstillaworkingTheprincipalTheprincipalformsofXQueryexpressionspathelementFLWR("flower")listdatatypeEvaluatedwithrespecttoaFOR$pINdocument("bib.xml")//publisherLETFOR$pINdocument("bib.xml")//publisherLET$b:=document("bib.xml”)//book[publisher=$p]WHEREcount($b)>100RETURN$pFORgeneratesanorderedlistofbindingsofpublishernamesto$pLETassociatestoeachbindingafurtherbindingofthelistofbookelementswiththatpublisherto$batthisstage,wehaveanorderedlistoftuplesofbindings:WHEREfiltersthatlisttoretainonlythedesiredRETURNconstructsforeachtuplearesultingQueriesSupportedbyQueriesSupportedbyLocation/position(“chapterSimple/play/titlecontainsPathtitlecontains/play//titlecontainsComplexEmployeeswithtwoSubsumes:WhataboutrelevanceHowXQueryHowXQuerymakesAlldocumentsinsetAmustberankedabovealldocumentsinsetB.Fragmentsmustbeorderedindepth-first,left-to-rightorder.XQuery:OrderByXQuery:OrderByfor$dinlet$e:=document("emps.xml")//emp[deptno=$d]wherecount($e)>=10orderbyavg($e/salary)descendingreturn<big-dept>{$d,XQuery:OrderXQuery:OrderByOrderbyclauseonlyallowsorderingbySaybyanattributeRelevanceIsoftenCan’tbeexpressedeasilyasfunctionofsettobeIsbetterabstractedoutofqueryformulation(cf.UniversityofUniversityofGoal:opensourceXMLsearch“Returnable”fragmentsareE.g.,don’treturna<bold>sometext</bold>StructuredDocumentRetrievalEmpoweruserswhodon’tknowtheEnablesearchforanypersonnomatterhowschemaencodesthedataDon’tworryaboutAtomicSpecifiedAtomicSpecifiedinOnlyatomicunitscanbereturnedasresultofsearch(unlessunitspecified)Tf.idfweightingisappliedtoatomicProbabilisticcombinationof“evidence”fromatomicunitsXIRQLXIRQLAsystemshouldalwaysretrievethemostspecificpartofadocumentansweringaquery.Examplequery:<chapter>0.3<section>0.8XQL0.7syntaxReturnsection,notAugmentationEnsureAugmentationEnsurethatStructuredDocumentRetrievalPrincipleisrespected.Assumedifferentqueryconditionsaredisjointevents->independence.er)*P(XQL|section)–n)=0.3+0.6*0.8-0.3*0.6*0.8=0.636SectionrankedaheadofExample:AssignExample:AssignallelementsandattributeswithpersonsemanticstothisdatatypeAllowusertosearchforwithoutspecifyingXIRQL:RelevanceXIRQL:RelevanceFragment/contextDatatypesSemanticXMLXMLNativeXMLNativeXMLUsesXMLdocumentaslogicalShouldPCDATA(parsedcharacterDocumentContrastDBmodifiedforGenericIRsystemmodifiedforXMLIndexingandMostnativeXMLIndexingandMostnativeXMLdatabasestakenaDBNoIRtyperelevanceOnlyafewthatfocusonrelevanceDatavs.Text-centricDatavs.Text-centricData-centricXML:usedformessagingbetweenenterpriseapplicationsMainlyarecastingofrelationalContent-centricXML:usedforannotatingRichinDemandsgoodintegrationoftextretrievalE.g.,findmetheISBN#sofBookswithatleastthreeChaptersdiscussingcocoaproduction,rankedbyPriceDatastructuresDatastructuresforXMLAverybasicDatastructuresforDatastructuresforXMLWhataretheprimitivesweInvertedindex:givemeallelementsmatchingtextqueryQWeknowhowtodothis–treateachelementasadocumentGivemeallelements(immediately)belowanyinstanceoftheBookelementCombinationoftheParent/childNumbereachParent/childNumbereachMaintainalistofparent-childE.g.,Chapter:21EnablesimmediateButwhatabout“thewordHamletunderSceneelementunderaPlayGeneralpositionalViewtheXMLdocumentasatextGeneralpositionalViewtheXMLdocumentasatextBuildapositionalindexforeachMarkthebeginningandendforeachelement,PositionaldroppethunderVersePositionaldroppethunderVerseunderPl6y.SummaryofdataSummaryofdataPathcontainmentetc.canessentiallybesolvedbypositionalinvertedindexesRetrievalconsistsof“merging”Allthecompressiontricksetc.from276AarestillComplicationsarisefrominsertion/deletionofelements,textwithinelementsBeyondthescopeofthisINEX:aINEX:abenchmarkfortext-XMLBenchmarkforBenchmarkfortheevaluationofXMLAnalogofTREC(recallConsistsSetofXMLCollectionofretrievalEachengineindexesEachengineindexesEngineteamconvertsretrievaltasksintoInXMLquerylanguageunderstoodbyInresponse,theengineretrievesnotdocs,butelementswithindocsEngineranksretrievedINEXForINEXForeachquery,eachretrievedelementishuman-assessedontwomeasures:Relevance–howrelevantistheretrievedCoverage–istheretrievedelementtoospecific,toogeneral,orjustrightE.g.,ifthequeryseeksadefinitionoftheFastFourierTransform,doIgettheequation(toospecific),thechaptercontainingthedefinition(toogeneral)orthedefinitionitselfTheseassessmentsareturnedintocompositeprecision/recallmeasuresINEX12,107INEX12,107articlesfromIEEESociety494Averagearticle:1,532XMLAveragenodedepth=INEXEachINEXEachtopicisaninformationneed,oneoftwokinds:ContentOnly(CO)–freetextContentandStructure(CAS)–structuralconstraints,e.g.,containmentSampleINEXCOSampleINEXCO<Title>computationalbiology<Keywords>computationalbiology,bioinformatics,genome,genomics,proteomics,sequencing,proteinfolding<Description>Challengesthatarise,andapproachesbeingexplored,intheinterdisciplinaryfieldofcomputational<Narrative>Toberelevant,adocument/componentmusteithe

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論