氟離子選擇電極使用及氟離子濃度測量_第1頁
氟離子選擇電極使用及氟離子濃度測量_第2頁
氟離子選擇電極使用及氟離子濃度測量_第3頁
氟離子選擇電極使用及氟離子濃度測量_第4頁
氟離子選擇電極使用及氟離子濃度測量_第5頁
已閱讀5頁,還剩42頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

IntroductiontoInformationRetrieval

IntroductiontoInformationRetrievalCS276:InformationRetrievalandWebSearchTextClassification1ChrisManning,PanduNayakandPrabhakarRaghavanIntroductiontoInformationRetrieval

PrepworkThislecturepresumesthatyou’veseenthe124courseralectureonNa?veBayes,orequivalentWillrefertoNBwithoutdescribingitCh.13IntroductiontoInformationRetrieval

StandingqueriesThepathfromIRtotextclassification:Youhaveaninformationneedtomonitor,say:UnrestintheNigerdeltaregionYouwanttorerunanappropriatequeryperiodicallytofindnewnewsitemsonthistopicYouwillbesentnewdocumentsthatarefoundI.e.,it’snotrankingbutclassification(relevantvs.notrelevant)SuchqueriesarecalledstandingqueriesLongusedby“informationprofessionals”AmodernmassinstantiationisGoogleAlertsStandingqueriesare(hand-written)textclassifiersCh.13IntroductiontoInformationRetrieval

3IntroductiontoInformationRetrieval

Spamfiltering

AnothertextclassificationtaskFrom:""<takworlld@>Subject:realestateistheonlyway...gemoalvgkayAnyonecanbuyrealestatewithnomoneydownStoppayingrentTODAY!ThereisnoneedtospendhundredsoreventhousandsforsimilarcoursesIam22yearsoldandIhavealreadypurchased6propertiesusingthemethodsoutlinedinthistrulyINCREDIBLEebook.ChangeyourlifeNOW!=================================================ClickBelowtoorder:/sales/nmd.htm=================================================Ch.13IntroductiontoInformationRetrieval

Categorization/ClassificationGiven:ArepresentationofadocumentdIssue:howtorepresenttextdocuments.Usuallysometypeofhigh-dimensionalspace–bagofwordsAfixedsetofclasses:

C={c1,c2,…,cJ}Determine:Thecategoryofd:γ(d)∈C,whereγ(d)isaclassificationfunctionWewanttobuildclassificationfunctions(“classifiers”).Sec.13.1IntroductiontoInformationRetrieval

MultimediaGUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planninglanguageproofintelligence”TrainingData:TestData:Classes:(AI)DocumentClassification(Programming)(HCI)......Sec.13.1IntroductiontoInformationRetrieval

ClassificationMethods(1)ManualclassificationUsedbytheoriginalYahoo!DirectoryLooksmart,,ODP,PubMedAccuratewhenjobisdonebyexpertsConsistentwhentheproblemsizeandteamissmallDifficultandexpensivetoscaleMeansweneedautomaticclassificationmethodsforbigproblemsCh.13IntroductiontoInformationRetrieval

ClassificationMethods(2)Hand-codedrule-basedclassifiersOnetechniqueusedbynewagencies,intelligenceagencies,etc.WidelydeployedingovernmentandenterpriseVendorsprovide“IDE”forwritingsuchrulesCh.13IntroductiontoInformationRetrieval

ClassificationMethods(2)Hand-codedrule-basedclassifiersCommercialsystemshavecomplexquerylanguagesAccuracyiscanbehighifarulehasbeencarefullyrefinedovertimebyasubjectexpertBuildingandmaintainingtheserulesisexpensiveCh.13IntroductiontoInformationRetrieval

AVeritytopic

AcomplexclassificationruleNote:maintenanceissues(author,etc.)Hand-weightingofterms[VeritywasboughtbyAutonomy,whichwasboughtbyHP...]Ch.13IntroductiontoInformationRetrieval

ClassificationMethods(3):

SupervisedlearningGiven:AdocumentdAfixedsetofclasses:

C={c1,c2,…,cJ}Atrainingset

DofdocumentseachwithalabelinCDetermine:AlearningmethodoralgorithmwhichwillenableustolearnaclassifierγForatestdocumentd,weassignittheclassγ(d)∈CSec.13.1IntroductiontoInformationRetrieval

ClassificationMethods(3)SupervisedlearningNaiveBayes(simple,common)–seevideok-NearestNeighbors(simple,powerful)Support-vectormachines(new,generallymorepowerful)…plusmanyothermethodsNofreelunch:requireshand-classifiedtrainingdataButdatacanbebuiltup(andrefined)byamateursManycommercialsystemsuseamixtureofmethodsCh.13IntroductiontoInformationRetrieval

ThebagofwordsrepresentationIlovethismovie!It'ssweet,butwithsatiricalhumor.Thedialogueisgreatandtheadventurescenesarefun…Itmanagestobewhimsicalandromanticwhilelaughingattheconventionsofthefairytalegenre.Iwouldrecommendittojustaboutanyone.I'veseenitseveraltimes,andI'malwayshappytoseeitagainwheneverIhaveafriendwhohasn'tseenityet.γ()=cIntroductiontoInformationRetrieval

Thebagofwordsrepresentationγ()=cgreat2love2recommend1laugh1happy1......IntroductiontoInformationRetrieval

FeaturesSupervisedlearningclassifierscanuseanysortoffeatureURL,emailaddress,punctuation,capitalization,dictionaries,networkfeaturesInthebagofwordsviewofdocumentsWeuseonlywordfeaturesweuseallofthewordsinthetext(notasubset)IntroductiontoInformationRetrieval

FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreSelectionmaymakeaparticularclassifierfeasibleSomeclassifierscan’tdealwith1,000,000featuresReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesMakesruntimemodelssmallerandfasterCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingSec.13.5IntroductiontoInformationRetrieval

FeatureSelection:FrequencyThesimplestfeatureselectionmethod:JustusethecommonesttermsNoparticularfoundationButitmakesensewhythisworksThey’rethewordsthatcanbewell-estimatedandaremostoftenavailableasevidenceInpractice,thisisoften90%asgoodasbettermethodsSmarterfeatureselection–futurelectureIntroductiontoInformationRetrieval

EvaluatingCategorizationEvaluationmustbedoneontestdatathatareindependentofthetrainingdataSometimesusecross-validation(averagingresultsovermultipletrainingandtestsplitsoftheoveralldata)Easytogetgoodperformanceonatestsetthatwasavailabletothelearnerduringtraining(e.g.,justmemorizethetestset)Sec.13.6IntroductiontoInformationRetrieval

EvaluatingCategorizationMeasures:precision,recall,F1,classificationaccuracyClassificationaccuracy:r/nwherenisthetotalnumberoftestdocsandristhenumberoftestdocscorrectlyclassifiedSec.13.6IntroductiontoInformationRetrieval

WebKBExperiment(1998)ClassifywebpagesfromCSdepartmentsinto:student,faculty,course,projectTrainon~5,000hand-labeledwebpagesCornell,Washington,U.Texas,WisconsinCrawlandclassifyanewsite(CMU)usingNa?veBayesResultsSec.13.6IntroductiontoInformationRetrieval

IntroductiontoInformationRetrieval

SpamAssassinNa?veBayeshasfoundahomeinspamfilteringPaulGraham’sAPlanforSpamWidelyusedinspamfiltersButmanyfeaturesbeyondwords:blackholelists,etc.particularhand-craftedtextpatternsIntroductiontoInformationRetrieval

SpamAssassinFeatures:Basic(Na?ve)BayesspamprobabilityMentions:GenericViagraRegex:millionsof(dollar)((dollar)NN,NNN,NNN.NN)Phrase:impress...girlPhrase:‘PrestigiousNon-AccreditedUniversities’From:startswithmanynumbersSubjectisallcapitalsHTMLhasalowratiooftexttoimageareaRelayinRBL,/enduserinfo_rbl.htmlRCVDlinelooksfaked/tests_3_3_x.htmlIntroductiontoInformationRetrieval

NaiveBayesisNotSoNaiveVeryfastlearningandtesting(basicallyjustcountwords)LowstoragerequirementsVerygoodindomainswithmanyequallyimportantfeaturesMorerobusttoirrelevantfeaturesthanmanylearningmethods IrrelevantfeaturescanceleachotherwithoutaffectingresultsIntroductiontoInformationRetrieval

NaiveBayesisNotSoNaiveMorerobusttoconceptdrift(changingclassdefinitionovertime)NaiveBayeswon1stand2ndplaceinKDD-CUP97competitionoutof16systems Goal:Financialservicesindustrydirectmailresponseprediction:Predictiftherecipientofmailwillactuallyrespondtotheadvertisement–750,000records.Agooddependablebaselinefortextclassification(butnotthebest)!IntroductiontoInformationRetrieval

ClassificationUsingVectorSpacesInvectorspaceclassification,trainingsetcorrespondstoalabeledsetofpoints(equivalently,vectors)Premise1:DocumentsinthesameclassformacontiguousregionofspacePremise2:Documentsfromdifferentclassesdon’toverlap(much)Learningaclassifier:buildsurfacestodelineateclassesinthespace28DocumentsinaVectorSpaceGovernmentScienceArtsSec.14.129TestDocumentofwhatclass?GovernmentScienceArtsSec.14.130TestDocument=GovernmentGovernmentScienceArtsIsthissimilarityhypothesistrueingeneral?Ourfocus:howtofindgoodseparatorsSec.14.1DefinitionofcentroidWhereDc

isthesetofalldocumentsthatbelongtoclasscandv(d)isthevectorspacerepresentationofd.Notethatcentroidwillingeneralnotbeaunitvectorevenwhentheinputsareunitvectors.31Sec.14.2RocchioclassificationRocchioformsasimplerepresentativeforeachclass:thecentroid/prototypeClassification:nearestprototype/centroidItdoesnotguaranteethatclassificationsareconsistentwiththegiventrainingdata32Sec.14.2RocchioclassificationLittleusedoutsidetextclassificationIthasbeenusedquiteeffectivelyfortextclassificationButingeneralworsethanNa?veBayesAgain,cheaptotrainandtestdocuments33Sec.14.234kNearestNeighborClassificationkNN=kNearestNeighborToclassifyadocumentd:Definek-neighborhoodastheknearestneighborsofdPickthemajorityclasslabelinthek-neighborhoodSec.14.335Example:k=6(6NN)GovernmentScienceArtsP(science|)?Sec.14.336Nearest-NeighborLearningLearning:juststorethelabeledtrainingexamplesDTestinginstancex(under1NN):ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.DoesnotcomputeanythingbeyondstoringtheexamplesAlsocalled:Case-basedlearningMemory-basedlearningLazylearningRationaleofkNN:contiguityhypothesisSec.14.337kNearestNeighborUsingonlytheclosestexample(1NN)subjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.,anerror)inthecategorylabelofasingletrainingexample.Morerobust:findthekexamplesandreturnthemajoritycategoryofthesekkistypicallyoddtoavoidties;3and5aremostcommonSec.14.338kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedrakNNgiveslocallydefineddecisionboundariesbetweenclasses–farawaypointsdonotinfluenceeachclassificationdecision(unlikeinNa?veBayes,Rocchio,etc.)Sec.14.339Illustrationof3NearestNeighborforTextVectorSpaceSec.14.3403NearestNeighborvs.RocchioNearestNeighbortendstohandlepolymorphiccategoriesbetterthanRocchio/NB.41kNN:DiscussionNofeatureselectionnecessaryNotrainingnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectMaybeexpensiveattesttimeInmostcasesit’smoreaccuratethanNBorRocchioSec.14.3Let’stestourintuitionCanabagofwordsalwaysbeviewedasavectorspace?Whataboutabagoffeatures?Canwealwaysviewastandingqueryasaregioninavectorspace?WhataboutBooleanqueriesonterms?Whatdo“rectangles”equateto?4243Biasvs.capacity–notionsandterminologyConsideraskingabotanist:Isanobjectatree?

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論