版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
IntroductiontoInformationRetrieval
IntroductiontoInformationRetrievalCS276:InformationRetrievalandWebSearchTextClassification1ChrisManning,PanduNayakandPrabhakarRaghavanIntroductiontoInformationRetrieval
PrepworkThislecturepresumesthatyou’veseenthe124courseralectureonNa?veBayes,orequivalentWillrefertoNBwithoutdescribingitCh.13IntroductiontoInformationRetrieval
StandingqueriesThepathfromIRtotextclassification:Youhaveaninformationneedtomonitor,say:UnrestintheNigerdeltaregionYouwanttorerunanappropriatequeryperiodicallytofindnewnewsitemsonthistopicYouwillbesentnewdocumentsthatarefoundI.e.,it’snotrankingbutclassification(relevantvs.notrelevant)SuchqueriesarecalledstandingqueriesLongusedby“informationprofessionals”AmodernmassinstantiationisGoogleAlertsStandingqueriesare(hand-written)textclassifiersCh.13IntroductiontoInformationRetrieval
3IntroductiontoInformationRetrieval
Spamfiltering
AnothertextclassificationtaskFrom:""<takworlld@>Subject:realestateistheonlyway...gemoalvgkayAnyonecanbuyrealestatewithnomoneydownStoppayingrentTODAY!ThereisnoneedtospendhundredsoreventhousandsforsimilarcoursesIam22yearsoldandIhavealreadypurchased6propertiesusingthemethodsoutlinedinthistrulyINCREDIBLEebook.ChangeyourlifeNOW!=================================================ClickBelowtoorder:/sales/nmd.htm=================================================Ch.13IntroductiontoInformationRetrieval
Categorization/ClassificationGiven:ArepresentationofadocumentdIssue:howtorepresenttextdocuments.Usuallysometypeofhigh-dimensionalspace–bagofwordsAfixedsetofclasses:
C={c1,c2,…,cJ}Determine:Thecategoryofd:γ(d)∈C,whereγ(d)isaclassificationfunctionWewanttobuildclassificationfunctions(“classifiers”).Sec.13.1IntroductiontoInformationRetrieval
MultimediaGUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planninglanguageproofintelligence”TrainingData:TestData:Classes:(AI)DocumentClassification(Programming)(HCI)......Sec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(1)ManualclassificationUsedbytheoriginalYahoo!DirectoryLooksmart,,ODP,PubMedAccuratewhenjobisdonebyexpertsConsistentwhentheproblemsizeandteamissmallDifficultandexpensivetoscaleMeansweneedautomaticclassificationmethodsforbigproblemsCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersOnetechniqueusedbynewagencies,intelligenceagencies,etc.WidelydeployedingovernmentandenterpriseVendorsprovide“IDE”forwritingsuchrulesCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersCommercialsystemshavecomplexquerylanguagesAccuracyiscanbehighifarulehasbeencarefullyrefinedovertimebyasubjectexpertBuildingandmaintainingtheserulesisexpensiveCh.13IntroductiontoInformationRetrieval
AVeritytopic
AcomplexclassificationruleNote:maintenanceissues(author,etc.)Hand-weightingofterms[VeritywasboughtbyAutonomy,whichwasboughtbyHP...]Ch.13IntroductiontoInformationRetrieval
ClassificationMethods(3):
SupervisedlearningGiven:AdocumentdAfixedsetofclasses:
C={c1,c2,…,cJ}Atrainingset
DofdocumentseachwithalabelinCDetermine:AlearningmethodoralgorithmwhichwillenableustolearnaclassifierγForatestdocumentd,weassignittheclassγ(d)∈CSec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(3)SupervisedlearningNaiveBayes(simple,common)–seevideok-NearestNeighbors(simple,powerful)Support-vectormachines(new,generallymorepowerful)…plusmanyothermethodsNofreelunch:requireshand-classifiedtrainingdataButdatacanbebuiltup(andrefined)byamateursManycommercialsystemsuseamixtureofmethodsCh.13IntroductiontoInformationRetrieval
ThebagofwordsrepresentationIlovethismovie!It'ssweet,butwithsatiricalhumor.Thedialogueisgreatandtheadventurescenesarefun…Itmanagestobewhimsicalandromanticwhilelaughingattheconventionsofthefairytalegenre.Iwouldrecommendittojustaboutanyone.I'veseenitseveraltimes,andI'malwayshappytoseeitagainwheneverIhaveafriendwhohasn'tseenityet.γ()=cIntroductiontoInformationRetrieval
Thebagofwordsrepresentationγ()=cgreat2love2recommend1laugh1happy1......IntroductiontoInformationRetrieval
FeaturesSupervisedlearningclassifierscanuseanysortoffeatureURL,emailaddress,punctuation,capitalization,dictionaries,networkfeaturesInthebagofwordsviewofdocumentsWeuseonlywordfeaturesweuseallofthewordsinthetext(notasubset)IntroductiontoInformationRetrieval
FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreSelectionmaymakeaparticularclassifierfeasibleSomeclassifierscan’tdealwith1,000,000featuresReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesMakesruntimemodelssmallerandfasterCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingSec.13.5IntroductiontoInformationRetrieval
FeatureSelection:FrequencyThesimplestfeatureselectionmethod:JustusethecommonesttermsNoparticularfoundationButitmakesensewhythisworksThey’rethewordsthatcanbewell-estimatedandaremostoftenavailableasevidenceInpractice,thisisoften90%asgoodasbettermethodsSmarterfeatureselection–futurelectureIntroductiontoInformationRetrieval
EvaluatingCategorizationEvaluationmustbedoneontestdatathatareindependentofthetrainingdataSometimesusecross-validation(averagingresultsovermultipletrainingandtestsplitsoftheoveralldata)Easytogetgoodperformanceonatestsetthatwasavailabletothelearnerduringtraining(e.g.,justmemorizethetestset)Sec.13.6IntroductiontoInformationRetrieval
EvaluatingCategorizationMeasures:precision,recall,F1,classificationaccuracyClassificationaccuracy:r/nwherenisthetotalnumberoftestdocsandristhenumberoftestdocscorrectlyclassifiedSec.13.6IntroductiontoInformationRetrieval
WebKBExperiment(1998)ClassifywebpagesfromCSdepartmentsinto:student,faculty,course,projectTrainon~5,000hand-labeledwebpagesCornell,Washington,U.Texas,WisconsinCrawlandclassifyanewsite(CMU)usingNa?veBayesResultsSec.13.6IntroductiontoInformationRetrieval
IntroductiontoInformationRetrieval
SpamAssassinNa?veBayeshasfoundahomeinspamfilteringPaulGraham’sAPlanforSpamWidelyusedinspamfiltersButmanyfeaturesbeyondwords:blackholelists,etc.particularhand-craftedtextpatternsIntroductiontoInformationRetrieval
SpamAssassinFeatures:Basic(Na?ve)BayesspamprobabilityMentions:GenericViagraRegex:millionsof(dollar)((dollar)NN,NNN,NNN.NN)Phrase:impress...girlPhrase:‘PrestigiousNon-AccreditedUniversities’From:startswithmanynumbersSubjectisallcapitalsHTMLhasalowratiooftexttoimageareaRelayinRBL,/enduserinfo_rbl.htmlRCVDlinelooksfaked/tests_3_3_x.htmlIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveVeryfastlearningandtesting(basicallyjustcountwords)LowstoragerequirementsVerygoodindomainswithmanyequallyimportantfeaturesMorerobusttoirrelevantfeaturesthanmanylearningmethods IrrelevantfeaturescanceleachotherwithoutaffectingresultsIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveMorerobusttoconceptdrift(changingclassdefinitionovertime)NaiveBayeswon1stand2ndplaceinKDD-CUP97competitionoutof16systems Goal:Financialservicesindustrydirectmailresponseprediction:Predictiftherecipientofmailwillactuallyrespondtotheadvertisement–750,000records.Agooddependablebaselinefortextclassification(butnotthebest)!IntroductiontoInformationRetrieval
ClassificationUsingVectorSpacesInvectorspaceclassification,trainingsetcorrespondstoalabeledsetofpoints(equivalently,vectors)Premise1:DocumentsinthesameclassformacontiguousregionofspacePremise2:Documentsfromdifferentclassesdon’toverlap(much)Learningaclassifier:buildsurfacestodelineateclassesinthespace28DocumentsinaVectorSpaceGovernmentScienceArtsSec.14.129TestDocumentofwhatclass?GovernmentScienceArtsSec.14.130TestDocument=GovernmentGovernmentScienceArtsIsthissimilarityhypothesistrueingeneral?Ourfocus:howtofindgoodseparatorsSec.14.1DefinitionofcentroidWhereDc
isthesetofalldocumentsthatbelongtoclasscandv(d)isthevectorspacerepresentationofd.Notethatcentroidwillingeneralnotbeaunitvectorevenwhentheinputsareunitvectors.31Sec.14.2RocchioclassificationRocchioformsasimplerepresentativeforeachclass:thecentroid/prototypeClassification:nearestprototype/centroidItdoesnotguaranteethatclassificationsareconsistentwiththegiventrainingdata32Sec.14.2RocchioclassificationLittleusedoutsidetextclassificationIthasbeenusedquiteeffectivelyfortextclassificationButingeneralworsethanNa?veBayesAgain,cheaptotrainandtestdocuments33Sec.14.234kNearestNeighborClassificationkNN=kNearestNeighborToclassifyadocumentd:Definek-neighborhoodastheknearestneighborsofdPickthemajorityclasslabelinthek-neighborhoodSec.14.335Example:k=6(6NN)GovernmentScienceArtsP(science|)?Sec.14.336Nearest-NeighborLearningLearning:juststorethelabeledtrainingexamplesDTestinginstancex(under1NN):ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.DoesnotcomputeanythingbeyondstoringtheexamplesAlsocalled:Case-basedlearningMemory-basedlearningLazylearningRationaleofkNN:contiguityhypothesisSec.14.337kNearestNeighborUsingonlytheclosestexample(1NN)subjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.,anerror)inthecategorylabelofasingletrainingexample.Morerobust:findthekexamplesandreturnthemajoritycategoryofthesekkistypicallyoddtoavoidties;3and5aremostcommonSec.14.338kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedrakNNgiveslocallydefineddecisionboundariesbetweenclasses–farawaypointsdonotinfluenceeachclassificationdecision(unlikeinNa?veBayes,Rocchio,etc.)Sec.14.339Illustrationof3NearestNeighborforTextVectorSpaceSec.14.3403NearestNeighborvs.RocchioNearestNeighbortendstohandlepolymorphiccategoriesbetterthanRocchio/NB.41kNN:DiscussionNofeatureselectionnecessaryNotrainingnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectMaybeexpensiveattesttimeInmostcasesit’smoreaccuratethanNBorRocchioSec.14.3Let’stestourintuitionCanabagofwordsalwaysbeviewedasavectorspace?Whataboutabagoffeatures?Canwealwaysviewastandingqueryasaregioninavectorspace?WhataboutBooleanqueriesonterms?Whatdo“rectangles”equateto?4243Biasvs.capacity–notionsandterminologyConsideraskingabotanist:Isanobjectatree?
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 軟件安全設(shè)計(jì)評(píng)估報(bào)告范本
- 浙江省麗水市青田縣2023-2024學(xué)年五年級(jí)上學(xué)期英語期末試卷
- 石材固色劑知識(shí)培訓(xùn)課件
- 塑造五種心態(tài)培訓(xùn)課件4
- 年產(chǎn)6萬噸飼用餅干粉和面包粉項(xiàng)目可行性研究報(bào)告寫作模板-申批備案
- 二零二五年度地產(chǎn)公司建筑工程合同風(fēng)險(xiǎn)評(píng)估與防控策略3篇
- 禮儀知識(shí)培訓(xùn)課件
- 二零二五年度辦公樓主體結(jié)構(gòu)施工與智慧安防系統(tǒng)合同3篇
- 中國大陸自閉癥干預(yù)方法研究綜述
- Unit 9 Can you come to my party Section A 1a~1c 說課稿 -2024-2025學(xué)年人教版八年級(jí)英語上冊
- 口腔頜面外科學(xué) 09顳下頜關(guān)節(jié)疾病
- 臺(tái)達(dá)變頻器說明書
- 2023年廣東羅浮山旅游集團(tuán)有限公司招聘筆試題庫及答案解析
- DB11-T1835-2021 給水排水管道工程施工技術(shù)規(guī)程高清最新版
- 解剖篇2-1內(nèi)臟系統(tǒng)消化呼吸生理學(xué)
- 《小學(xué)生錯(cuò)別字原因及對策研究(論文)》
- 北師大版七年級(jí)數(shù)學(xué)上冊教案(全冊完整版)教學(xué)設(shè)計(jì)含教學(xué)反思
- 智慧水庫平臺(tái)建設(shè)方案
- 系統(tǒng)性紅斑狼瘡-第九版內(nèi)科學(xué)
- 全統(tǒng)定額工程量計(jì)算規(guī)則1994
- 糧食平房倉設(shè)計(jì)規(guī)范
評(píng)論
0/150
提交評(píng)論