




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Search Engine,朱廷劭(Zhu, Tingshao)Ph.D,Examples of search engines,Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe)Search by visual appea
2、rance (shapes, colors, ). Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language Clustering systems (Vivisimo, Clusty) Research systems (Lemur, Nutch),What does it take to build a search engine?,Decide what to index Collect it Index it (efficiently) Keep the index u
3、p to date Provide user-friendly query facilities,What else?,Understand the structure of the web for efficient crawling Understand user information needs Preprocess text and other unstructured data Cluster data Classify data Evaluate performance,Gather the contents of all web pages (using a program c
4、alled a crawler or spider) Organize the contents of the pages in a way that allows efficient retrieval (indexing) Take in a query, determine which pages match, and show the results (ranking and display of results),Three main parts:,How Search Engines Work,Search engine servers,user query,Show result
5、s To user,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store the documents,Inverted index,Search engine servers,DocIds,Crawler machines,Standard Web Search Engine Architecture,crawl the web,Create an inverted index,Check for duplicates, store t
6、he documents,Inverted index,Search engine servers,user query,Show results To user,DocIds,Crawler machines,Standard Web Search Engine Architecture,How to find web pages to visit and copy? Can start with a list of domain names, visit the home pages there. Look at the hyperlink on the home page, and fo
7、llow those links to more pages. Use HTTP commands to GET the pages Keep a list of urls visited, and those still to be visited. Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.,Spiders (crawlers),Four Laws of Crawling,A Crawler must show identifica
8、tion A Crawler must obey the robots exclusion standard /wc/norobots.html A Crawler must not hog resources A Crawler must report errors,Example robots.txt file,/robots.txt (just the first few lines),Lots of tricky aspects,Servers are often down or slow Hyperli
9、nks can get the crawler into cycles Some websites have junk in the web pages Now many pages have dynamic content The “hidden” web E.g., You dont see the course schedules until you run a query. The web is HUGE,“Freshness”,Need to keep checking pages Pages change (25%,7% large ch
10、anges) At different frequencies Who is the fastest changing? Pages are removed Many search engines cache the pages (store a copy on their own servers),A small fraction of the Web that search engines know about; no search engine is exhaustive Not the “l(fā)ive” Web, but the search engines index Not the “
11、Deep Web” Mostly HTML pages but other file types too: PDF, Word, PPT, etc.,What really gets crawled?,Record information about each page List of words In the title? How far down in the page? Was the word in boldface? URLs of pages pointing to this one Anchor text on pages pointing to this one,Index (
12、the database),Inverted Index,How to store the words for fast lookup Basic steps: Make a “dictionary” of all the words in all of the web pages For each word, list all the documents it occurs in. Often omit very common words “stop words” Sometimes stem the words (also called morphological analysis) ca
13、ts - cat running - run,Inverted Index Example,Image from /documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html,Inverted Index,In reality, this index is HUGE Need to store the contents across many machines Need to do optimization tricks to make lookup fa
14、st.,Search engine receives a query, then Looks up the words in the index, retrieves many documents, then Rank orders the pages and extracts “snippets” or summaries containing query words. Most web search engines assume the user wants all of the words (Boolean AND, not OR). These are complex and high
15、ly guarded algorithms unique to each search engine.,Results ranking,For a given candidate result page, use: Number of matching query words in the page Proximity of matching words to one another Location of terms within the page Location of terms within tags e.g. , , link text, body text Anchor text
16、on pages pointing to this one Frequency of terms on the page and in general Link analysis of which pages point to this one (Sometimes) Click-through analysis: how often the page is clicked on How “fresh” is the page Complex formulae combine these together.,Some ranking criteria,Machine Learned Ranki
17、ng,Goal: Automatically construct a ranking function Input: Large number training examples Features that predict relevance Relevance metrics Output: Ranking function Enables rapid experimental cycle Scientific investigation of Modifications to existing features New feature,Ranking Features (from Jan
18、Pedersens lecture),A0 - A4anchor text score per term W0 - W4term weights L0 - L4first occurrence location (encodes hostname and title match) SPspam index: logistic regression of 85 spam filter variables (against relevance scores) F0 - F4term occurrence frequency within document DCLNdocument length (
19、tokens) EREigenrank HBExtra-host unique inlink count ERHBER*HB A0W0 etc.A0*W0 QASite factor logistic regression of 5 site link and url count ratios SPNProximity FFfamily friendly rating UDurl depth,Ranking Decision Tree (from Jan Pedersens Lecture),A0w0 22.3,L0 18+1,R=0.0015,L1 18+1,L1 509+1,R=-0.05
20、45,W0 856,F0 2 + 1,R = -0.2368,R=-0.0199,R=-0.1185,R=-0.0039,F2 1 + 1,R=-0.1604,R=-0.0790,Y,N,The importance of anchor text, i141 , A terrific course on search engines ,The anchor text summarizes what the website is about.,Measuring Importance of Linking,PageRank Algorithm Idea: important pages are
21、pointed to by other important pages Method: Each link from one page to another is counted as a “vote” for the destination page But the importance of the starting page also influences the importance of the destination page. And those pages scores, in turn, depend on those linking to them.,Image and e
22、xplanation from ,Measuring Importance of Linking,Example: each page starts with 100 points. Each pages score is recalculated by adding up the score from each incoming link. This is the score of the linking page divided by the number of outgoing links it has. E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to. Keep repeating the score updates until no more changes.,Image and explanation from ,What d
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 小區(qū)設(shè)備供應(yīng)管理辦法
- 腸梗阻說課課件
- 豐縣教招數(shù)學(xué)試卷
- 二上西師數(shù)學(xué)試卷
- 建筑領(lǐng)域培訓(xùn)課件
- 高二懷化市統(tǒng)考數(shù)學(xué)試卷
- 肝中醫(yī)講解課件
- 福建小學(xué)畢業(yè)班數(shù)學(xué)試卷
- 肉癭的護理課件
- 肝病的診斷和治療技術(shù)進展
- 影視視聽視聽語言課件
- 2023電力建設(shè)工程監(jiān)理月報范本
- 活性污泥法PPT參考課件
- 語文六年級下冊口語交際辯論20張
- (全)變電站全壽命周期管理建議
- 2022年福建華僑大學(xué)研究生院招聘行政人員筆試備考試題及答案解析
- 上市公司市值管理研究-以貴州百靈為例-畢業(yè)論文
- 熱烈歡迎領(lǐng)導(dǎo)蒞臨指導(dǎo)ppt模板
- VTS中雷達和AIS的技術(shù)應(yīng)用與進展
- 芬頓試劑投加量計算
- 建筑自動化課件2013 10.通信網(wǎng)絡(luò)技術(shù)
評論
0/150
提交評論