




已閱讀5頁(yè),還剩3頁(yè)未讀, 繼續(xù)免費(fèi)閱讀
版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
英文原文AnApproachtoReduceWebCrawlerTrafficUsingAsp.NetNowdayssearchenginetransfersthewebdatafromoneplacetoanother.Theyworkonclientserverarchitecturewherethecentralservermanagesalltheinformation.Awebcrawlerisaprogramthatextractstheinformationoverthewebandsendsittothesearchengineforfurtherprocessing.Itisfoundthatmaximumtraffic(approximately40.1%)isduetothewebcrawler.TheproposedschemeshowshowwebcrawlercanreducethetrafficusingDynamicwebpageandHTTPGET.I.INTRODUCTIONAllthesearchengineshavepowerfulcrawlersthatvisittheinternettimetotimeforextractingtheusefulinformationovertheinternet.Theretrievedpagesareindexedandstoredinthedatabaseasshowninfigure1.ActuallyInternetisadirectedgraph,orwebpageasanodeandhyperlinkasedge,sothesearchoperationcouldbeabstractedasaprocessoftraversingdirectedstructuregraph.Byfollowingthelinkedstructureoftheweb,wecantraverseanumberofnewpagesstartedfromstartingwebpages.Webcrawlersaredesignedtoretrievewebpagesandaddthemtheirrepresenttothelocalrepository/databases.Crawlerupdatestheirinformationonceaweek,sometimesitupdatemonthlyorquarterlyalso.Theycannotprovideup-to-dateversionoffrequentlyupdatedpages.Tocatchupfrequentupdateswithoutputtingalargeburdenoncontentprovider,webelieveretrievingandprocessingdatanearthedatasourceisinevitable.Currentlymorethanonesearchenginesareavailableinthemarket.Thatincreaseincomplexityofwebtraffichasrequiredthatwebaseourmodelonthenotationofwebrequestratherthanthewebpages.Webcrawleraresoftwaresystemsthatusethetextandlinksonwebpagestocreatesearchindexesofthepages,usingHTMLlinkstofolloworcrawltheconnectionsbetweenpages.Figure1,Architectureofawebsearchengine.TheWWWisawebofhyperlinkedrepositoryoftrillionsofhypertextdocuments9layingondifferentwebsites.WorldWideWeb(Web)trafficcontinuestoincreaseandisnowestimatedtobemorethan70percentofthetotaltrafficontheInternet.A.BasicCrawlingTerminologyWeneedtoknowsomebasicterminologyofwebcrawlerwhichplaysanimportantroleinimplementationofthewebcrawler.Seedpage:CrawlingmeanstotraversethewebrecursivelybypickedupthestartingURLfromthesetofURL.StartingURLisentrypointfromwhereallthecrawlersstarttheirsearchingprocedure.ThissetofURLknownasseedpage.Frontier:ThecrawlingprocedurestartswithagivenURL,ExtractingthelinkfromitandaddingthemtoanunvisitedlistofURL.thisunvisitedlistknownasfrontier.Thefrontierimplementedbyaqueue.ParserParsingmayimplysimplehyperlinked/URLextractionoritmayinvolvethemorecomplexprocessoftidyinguptheHTMLcontentinordertoanalyzetheHTMLtagtree.ThejobanyparseristoparsethefetchedpagestoextractthelistofnewURLfromitandreturnthenewunvisitedURLtothefrontier.TheBasicalgorithmofawebcrawlerisgivenbelow:StartReadtheURLfromtheseedURLCheckwhetherthedocumentsalreadydownloadedornotIfdocumentsarealreadydownload.Break.ElseAddittothefrontier.NowpicktheURLfromthatfrontierandextractthenewlinkfromitAddallthenewlyfoundURLintothefrontier.Continue.EndThemainfunctionofacrawleristoaddnewlinksintothefrontieraddtoselectanew.II.RELATEDWORKToreducethewebcrawlertrafficmanyresearchershascompletedtheirresearchinfollowingareas:InthisauthoruseddynamicwebpageswithHTTPGetrequestwithlastvisitparameter.Oneapproachistheuseofactivenetworktoreduceunnecessarycrawlertraffic.Theauthorproposedanapproachwhichusesthebandwidthcontrolsysteminordertoreducethewebcrawlertrafficovertheinternet.Oneistoplacethemobilecrawleratwebserver.Crawlercheckupdatesinwebsiteandsendthemtothesearchengineforindexing.DesignanewwebcrawlerusingVB.NETtechnology.III.PERFORMANCEMATRICESIntheimplementationofwebcrawlerwehavetakensomeassumptionsintotheaccountjustforsimplifyingalgorithmandimplementationandresults.RemoveaURLfromtheURLlistDeterminetheprotocolofunderlyinghostlikehttp,ftpetc.Downloadthecorrespondingdocument.Extractanylinkscontainedinit.AddtheselinksbacktotheURLlist.IV.SIMULATORThesimulatorhasbeendesignedtostudythebehaviorpatternofdifferentcrawlingalgorithmsfromthesamesetofURLs.WedesignedacrawlerusingVB.NETandASP.NETwindowapplicationprojecttypeourcrawlercanworkongloballyandlocally,meansitcangiveresultonintranetandinternet.ItuseURLinaformatlikeandsetalocationornameforsavingcrawlingresultsdatainMSAccessdatabase.Figure2,SnapshotofWebCrawler.SnapshotfortheuserinterfaceofWebCrawlerisrunningoneitherintranetorinternet.Fortakingaresultofcrawlerweuseawebsite.Ateachsimulationstep,theschedulerchoosesthetopmostwebsitefromthequeueofthewebsitesandsendsthissiteinformationtoamodulethatwillsimulatedownloadingpagesfromthewebsites.ForthissimulatorweusecrawlingpoliciesandsavethedatacollectedordownloadintheMS-Accessdatabasetablewithsomedatafield.CrawlingResult,TheCrawlingresultispresentintheformoftabledepictingtheresultintheformofrowandcolumnstheoutput,oftheCrawlerisshownasasnapshot.Figure3,SnapshotoftheCrawledResultDatabase.InthisproposedworkIanalyzedthatwhenwecrawledthewebsiteitdownloadedallthepagesofwebsite.SecondtimewhenIcrawledthesamesiteIfoundthatcrawlercrawledallthepagesagainwhilesiteupdatedonlyitsdynamicpagesandrarelyitsstaticpages.Forreducingthecrawlertrafficweproposetheuseofdynamicwebpagetoinformthewebcrawleraboutthenewpagesandupdatesonwebsite.Inexperimentweusewebsiteof7webpages.WebsitedeployedonASP.NETusingC#Language.DynamicwebpageiscodedinC#language.WebcrawleriscodedinVB.NET.LAST_VISITparameterpassedismillisecondtimeofsystem,returnbyC#,millisecondtimeismaintainedby“update”datastructure.Firstweperformcrawlingonwebsiteusingoldapproach.Thenweperformcrawlingusingproposedapproach.Whenweperformthewebcrawlingonwebsite.TheresultsobtainedshowninTable1.Totesttheproposedapproachwedirectthewebcrawlertodynamicwebpagedynamic.aspxandsetthelastvisittimeatURLandperformcrawling.Test1:UpdatetimeandURLofpagesindex,branchandpersonin“Update”datastructureatwebcrawlersettheLAST_VISITtimebeforetimeofpagesintheUpdate.Performedcrawling,resultsobtainedareshownintable2.Test2:UpdatetimeandURLofpageaboutin“Update”datastructure.AtwebcrawlersetstheLAST_VISITtime,beforethetimeofpagesintheupdate.Performedcrawling,resultsobtainedareshownintable3.Test3:UpdatetimeandURLofpagesserviceandqueryin“Update”datastructure.AtwebcrawlersettheLAST_VISITtimebeforetimeofpagesintheUpdate.Performedcrawling,resultsobtainedareshownintable4.Innormalcrawlingisatimeconsumingprocessbecausecrawlervisiteverywebpagetoknowallupdatedinformationinwebsite.Innormalcrawlingitvisitsatotalof7pages.Crawlertakes1385millisecondstovisitcompletesite.InproposedapproachcrawlervisitsDynamicupdatepageandupdatedwebpagesonly.Crawlertakeabout500millisecondswhenthereare3updates,about450millisecondswhentherearetwoupdate.WhentherearethreeupdatesinexperimentalWebsiteproposedsachemis4.83timefasterthanoldapproach.Withtwoupdatesproposedschemeis7.03timesfasterthanoldscheme.Graph1showstimetakenbywebcrawlertodownloadupdates.Innormalcrawlingcrawlervisits7pagestofindupdates.Butnumberofpagevisitisverysmallinproposedapproach.Whenthereisoneupdatecrawleronlyvisit2pagesandwhenthereare2updatescrawleronlyvisits3pages.Ifthereare3updatesinwebsitecrawlervisit4pages.V.CONCLUSIONWiththisapproachCrawlerfindnewupdatesonthewebserverusingDynamicwebpage.UsingthiscrawleryoucansendthequerieswithrequestedURLsandcanreducethemaximumcrawlertrafficovertheinternet.Itisfoundthatapproximately40.1%trafficisduetothewebcrawler.Sothatusingthismethodyoucanreduce50%trafficofthewebcrawler(meanshalfofthewebcrawlertraffici.e.20%overtheinternet).Thefutureworkofthispaperwillbewecanreducethecrawlertrafficusingpagerankmethodandbyusingsomeparameterslikeaslastmodifiedparameter.Thisparametertellsthemodifieddateandtimeofthefetchedpage.LastmodifiedparametercanbeusedbythecrawlerforfetchingthefreshpagesfromtheWebsites.Inhigh-levelterms,theMVCpatternmeansthatanMVCapplicationwillbesplitintoatleastthreepieces:Models,whichcontainorrepresentthedatathatusersworkwith.Thesecanbesimpleviewmodels,whichjustrepresentdatabeingtransferredbetweenviewsandcontrollers;ortheycanbedomainmodels,whichcontainthedatainabusinessdomainaswellastheoperations,transformations,andrulesformanipulatingthatdata.Views,whichareusedtorendersomepartofthemodelasaUI.Controllers,whichprocessincomingrequests,performoperationsonthemodel,andselectviewstorendertotheuser.Modelsarethedefinitionoftheuniverseyourapplicationworksin.Inabankingapplication,forexample,themodelrepresentseverythinginthebankthattheapplicationsupports,suchasaccounts,thegeneralledger,andcreditlimitsforcustomers,aswellastheoperationsthatcanbeusedtomanipulatethedatainthemodel,suchasdepositingfundsandmakingwithdrawalsfromtheaccounts.Themodelisalsoresponsibleforpreservingtheoverallstateandconsistencyofthedata;forexample,makingsurethatalltransactionsareaddedtotheledger,andthataclientdoesntwithdrawmoremoneythanheisentitl
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 廣西壯族自治區(qū)玉林市陸川縣2025屆英語(yǔ)七下期末預(yù)測(cè)試題含答案
- 安全生產(chǎn)主要負(fù)責(zé)人考試題及答案
- 安全生產(chǎn)月測(cè)試題及答案
- 安全生產(chǎn)管理試題及答案
- 安全建設(shè)試題及答案
- 安全管理試題及答案計(jì)算
- 社區(qū)零售業(yè)態(tài)創(chuàng)新與數(shù)字化運(yùn)營(yíng)模式在2025年的市場(chǎng)趨勢(shì)報(bào)告
- 課件改編培訓(xùn)方案模板
- 高校產(chǎn)學(xué)研合作技術(shù)轉(zhuǎn)移中的科技成果轉(zhuǎn)化與企業(yè)戰(zhàn)略協(xié)同研究報(bào)告
- 原材料管理課件
- 物業(yè)燃?xì)獍踩嘤?xùn)課件
- 老年護(hù)理實(shí)踐指南手冊(cè)(試行)全匯編
- 醫(yī)療器械生產(chǎn)質(zhì)量管理規(guī)范培訓(xùn)試題及答案
- 換熱器設(shè)備采購(gòu)合同模板合同
- 阿克蘇地區(qū)國(guó)土空間規(guī)劃(2021年-2035年)
- 臨時(shí)用地復(fù)墾措施施工方案
- 2022年7月國(guó)家開(kāi)放大學(xué)專(zhuān)科《法理學(xué)》期末紙質(zhì)考試試題及答案
- 【甲子光年】2024自動(dòng)駕駛行業(yè)報(bào)告-“端到端”漸行漸近
- 《城市道路照明設(shè)計(jì)標(biāo)準(zhǔn) CJJ45-2015》
- 外研版(一年級(jí)起點(diǎn))小學(xué)英語(yǔ)三年級(jí)下冊(cè)期末測(cè)試卷(含答案及聽(tīng)力音頻-材料)
- 遼寧省丹東市2023-2024學(xué)年八年級(jí)下學(xué)期7月期末歷史試題(無(wú)答案)
評(píng)論
0/150
提交評(píng)論