作業(yè)信息檢索lecture2-dictionary

上傳人：我*** IP屬地：北京上傳時間：2022-07-25 格式：PPTX 頁數(shù)：47 大小：232.59KB 積分：15 舉報 版權(quán)申訴

已閱讀5頁，還剩42頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、Lecture 2: The term vocabulary and postings listsRecap of the previous lectureBasic inverted indexes:Structure: Dictionary and PostingsKey step in construction: SortingBoolean query processingIntersection by linear time “merging”Simple optimizationsOverview of course topicsCh. 12Plan for this lectur

2、eElaborate basic indexingPreprocessing to form the term vocabularyDocumentsTokenizationWhat terms do we put in the index?PostingsFaster merges: skip listsPositional postings and phrase queries3Recall the basic indexing pipelineTokenizerToken stream.FriendsRomansCountrymenLinguistic modulesModified t

3、okens.friendromancountrymanIndexerInverted index.friendromancountryman24213161Documents tobe indexed.Friends, Romans, countrymen.4Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem, which we will st

4、udy later in the course.But these tasks are often done heuristically Sec. 2.15Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple

5、languages/formatsFrench email with a German pdf attachment.What is a unit document?A file?An email? (Perhaps one of many in an mbox.)An email with 5 attachments?A group of files (PPT or LaTeX as HTML pages)Sec. 2.16Tokens and Terms7TokenizationInput: “Friends, Romans, Countrymen”O(jiān)utput: TokensFriend

6、sRomansCountrymenA token is a sequence of characters in a documentEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?Sec. 2.2.18TokenizationIssues in tokenization:Finlands capital Finland? Finlands? Finlands?Hewlett-Packard

7、 Hewlett and Packard as two tokens?state-of-the-art: break up hyphenated sequence. co-educationlowercase, lower-case, lower case ?It can be effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?Sec. 2.2.19Numbers3/12/91 Mar. 12, 199112

8、/3/9155 B.C.B-52My PGP key is 324a3df234cb23e(800) 234-2333Often have embedded spacesOlder IR systems may not index numbersBut often very useful: think about things like looking up error codes/stacktraces on the webWill often index “meta-data” separatelyCreation date, format, etc.Sec. 2.2.110Tokeniz

9、ation: language issuesFrenchLensemble one token or two?L ? L ? Le ?Want lensemble to match with un ensembleUntil at least 2003, it didnt on GoogleInternationalization!German noun compounds are not segmentedLebensversicherungsgesellschaftsangestellterlife insurance company employeeGerman retrieval sy

10、stems benefit greatly from a compound splitter moduleCan give a 15% performance boost for German Sec. 2.2.111Tokenization: language issuesChinese and Japanese have no spaces between words:莎拉波娃現(xiàn)在居住在美國東南部的佛羅里達。Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple a

11、lphabets intermingledDates/amounts in multiple formats500社情報不足時間$500K(約6,000萬円)KatakanaHiraganaKanjiRomajiEnd-user can express query entirely in hiragana!Sec. 2.2.112Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left t

12、o rightWords are separated, but letter forms within a word form complex ligatures startAlgeria achieved its independence in 1962 after 132 years of French occupation.Sec. 2.2.113Stop wordsWith a stop list, you exclude from the dictionary entirely the commonest words. Intuition:They have little seman

13、tic content: the, a, and, to, beThere are a lot of them: 30% of postings for top 30 wordsBut the trend is away from doing this:Good compression techniques (lecture 5) means the space for including stopwords in a system is very smallGood query optimization techniques (lecture 7) mean you pay little a

14、t query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”Sec. 2.2.214Normalization to termsWe need to “normalize” words in indexed text as well as query words into the

15、 same formWe want to match U.S.A. and USAResult is terms: a term is a (normalized) word type, which is an entry in our IR system dictionaryWe most commonly implicitly define equivalence classes of terms by, e.g., deleting periods to form a termU.S.A., USA USAdeleting hyphens to form a termanti-discr

16、iminatory, antidiscriminatory antidiscriminatorySec. 2.2.315Normalization: other languagesAccents: e.g., French rsum vs. resume.Umlauts: e.g., German: Tuebingen vs. TbingenShould be equivalentMost important criterion:How are your users like to write their queries for these words?Even in languages th

17、at standardly have accents, users often may not type themOften best to normalize to a de-accented termTuebingen, Tbingen, Tubingen TubingenSec. 2.2.316Normalization: other languagesNormalization of things like date forms7月30日 vs. 7/30Japanese use of kana vs. Chinese charactersTokenization and normal

18、ization may depend on the language and so is intertwined with language detectionCrucial: Need to “normalize” indexed text as well as query terms into the same formMorgen will ich in MIT Is thisGerman “mit”?Sec. 2.2.317Case foldingReduce all letters to lower caseexception: upper case in mid-sentence?

19、e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of correct capitalizationSec. 2.2.318Normalization to termsAn alternative to equivalence classing is to do asymmetric expansionAn example of where this may be usefulEnter: windo

20、wSearch: window, windowsEnter: windowsSearch: Windows, windows, windowEnter: WindowsSearch: WindowsPotentially more powerful, but less efficientSec. 2.2.319Thesauri and soundexDo we handle synonyms and homonyms?E.g., by hand-constructed equivalence classescar = automobile color = colourWe can rewrit

21、e to form equivalence-class termsWhen the document contains automobile, index it under car-automobile (and vice-versa)Or we can expand a queryWhen the query contains automobile, look under car as wellWhat about spelling mistakes?One approach is soundex, which forms equivalence classes of words based

22、 on phonetic heuristicsMore in lectures 3 and 920LemmatizationReduce inflectional/variant forms to base formE.g.,am, are, is becar, cars, cars, cars carthe boys cars are different colors the boy car be different colorLemmatization implies doing “proper” reduction to dictionary headword formSec. 2.2.

23、421StemmingReduce terms to their “roots” before indexing“Stemming” suggest crude affix choppinglanguage dependente.g., automate(s), automatic, automation all reduced to automat.for example compressed and compression are both accepted as equivalent to compress.for exampl compress andcompress ar both

24、acceptas equival to compressSec. 2.2.422Porters algorithmCommonest algorithm for stemming EnglishResults suggest its at least as good as other stemming optionsConventions + 5 phases of reductionsphases applied sequentiallyeach phase consists of a set of commandssample convention: Of the rules in a c

25、ompound command, select the one that applies to the longest suffix.Sec. 2.2.423Typical rules in Portersses ssies iational atetional tionRules sensitive to the measure of words (m1) EMENT replacement replaccement cementSec. 2.2.424Other stemmersOther stemmers exist, e.g., Lovins stemmer Single-pass,

26、longest suffix removal (about 250 rules)Full morphological analysis at most modest benefits for retrievalDo stemming and other normalizations help?English: very mixed results. Helps recall but harms precisionoperative (dentistry) operoperational (research) operoperating (systems) operDefinitely usef

27、ul for Spanish, German, Finnish, 30% performance gains for Finnish!Sec. 2.2.425Language-specificityMany of the above features embody transformations that areLanguage-specific andOften, application-specificThese are “plug-in” addenda to the indexing processBoth open source and commercial plug-ins are

28、 available for handling theseSec. 2.2.426Dictionary entriesensemble.french時間.japaneseMIT.englishmit.germanguaranteed.englishentries.englishsometimes.englishtokenization.englishThese may be grouped by language (or not). More on this in ranking/query processing.Sec. 2.227Faster postings merges:Skip po

29、inters/Skip lists28Recall basic mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries128312484148641238111721BrutusCaesar28If the list lengths are m and n, the merge takes O(m+n)operations.Can we do better?Yes (if index isnt changing too fast).Sec.

30、 2.329Augment postings with skip pointers (at indexing time)Why?To skip postings that will not figure in the search results.How?Where do we place skip pointers?128248414864311238111721311141128Sec. 2.330Query processing with skip pointers128248414864311238111721311141128Suppose weve stepped through

31、the lists until we process 8 on each list. We match it and advance.We then have 41 and 11 on the lower. 11 is smaller.But the skip successor of 11 on the lower list is 31, sowe can skip ahead past the intervening postings.Sec. 2.331Where do we place skips?Tradeoff:More skips shorter skip spans more

32、likely to skip. But lots of comparisons to skip pointers.Fewer skips few pointer comparison, but then long skip spans few successful skips.Sec. 2.332Placing skipsSimple heuristic: for postings of length L, use L evenly-spaced skip pointers.This ignores the distribution of query terms.Easy if the ind

33、ex is relatively static; harder if L keeps changing because of updates.This definitely used to help; with modern hardware it may not unless youre memory-basedThe I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!Sec. 2.333Phrase queries and positional i

34、ndexes34Phrase queriesWant to be able to answer queries such as “stanford university” as a phraseThus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that worksMany more querie

35、s are implicit phrase queriesFor this, it no longer suffices to store only entriesSec. 2.435A first attempt: Biword indexesIndex every consecutive pair of terms in the text as a phraseFor example the text “Friends, Romans, Countrymen” would generate the biwordsfriends romansromans countrymenEach of

36、these biwords is now a dictionary termTwo-word phrase query-processing is now immediate.Sec. 2.4.136Longer phrase queriesLonger phrases are processed as we did with wild-cards:stanford university palo alto can be broken into the Boolean query on biwords:stanford university AND university palo AND pa

37、lo altoWithout the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.Can have false positives!Sec. 2.4.137Extended biwordsParse the indexed text and perform part-of-speech-tagging (POST).Bucket the terms into (say) Nouns (N) and articles/prepositions (X).Cal

38、l any string of terms of the form NX*N an extended biword.Each such extended biword is now made a term in the dictionary.Example: catcher in the rye N X X NQuery processing: parse it into Ns and XsSegment query into enhanced biwordsLook up in index: catcher ryeSec. 2.4.138Issues for biword indexesFa

39、lse positives, as noted beforeIndex blowup due to bigger dictionaryInfeasible for more than biwords, big even for themBiword indexes are not the standard solution (for all biwords) but can be part of a compound strategySec. 2.4.139Solution 2: Positional indexesIn the postings, store for each term th

40、e position(s) in which tokens of it appear:Sec. 2.4.240Positional index exampleFor phrase queries, we use a merge algorithm recursively at the document levelBut we now need to deal with more than just equalityWhich of docs 1,2,4,5could contain “to beor not to be”?Sec. 2.4.241Processing a phrase quer

41、yExtract inverted index entries for each distinct term: to, be, or, not.Merge their doc:position lists to enumerate all positions with “to be or not to be”.to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; .be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; .Same general method for proximity sear

42、chesSec. 2.4.242Proximity queriesLIMIT! /3 STATUTE /3 FEDERAL /2 TORT Again, here, /k means “within k words of”.Clearly, positional indexes can be used for such queries; biword indexes cannot.Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?This is a little tricky to do correctly

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

作業(yè)信息檢索lecture2-dictionary

文檔簡介

溫馨提示

最新文檔

評論

作業(yè)信息檢索lecture2-dictionary

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔