版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
第八講
分類與預(yù)測(cè)1廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自學(xué))ClassificationbybackpropagationSupportVectorMachines(SVM)(自學(xué))Associativeclassification(自學(xué))Lazylearners(orlearningfromyourneighbors)(自學(xué))Otherclassificationmethods(自學(xué))PredictionAccuracyanderrormeasures(自學(xué))Ensemblemethods(自學(xué))Modelselection(自學(xué))Summary2廈門大學(xué)軟件學(xué)院分類(Classification):
預(yù)測(cè)分類標(biāo)號(hào)(離散值或名詞性詞)建立一個(gè)模型,基于訓(xùn)練集的數(shù)據(jù)和分類屬性的值(類標(biāo)識(shí))來(lái)分類,并在新數(shù)據(jù)中使用該模型。預(yù)測(cè)(Prediction):連續(xù)值函數(shù)上的建模,例如,預(yù)測(cè)未知或缺失的值典型應(yīng)用信用度目標(biāo)市場(chǎng)醫(yī)療診斷分類vs.預(yù)測(cè)3廈門大學(xué)軟件學(xué)院分類的兩個(gè)步驟模型創(chuàng)建:描述預(yù)定的數(shù)據(jù)類集或概念集。假定每個(gè)元組或樣本屬于一個(gè)預(yù)定義的類,由一個(gè)稱為類標(biāo)號(hào)屬性(classlabelattribute)的屬性確定。用于建模的元組集稱為訓(xùn)練集(trainingset)模型表示為:分類規(guī)則、判定樹(shù)或?qū)傩怨侥P蛻?yīng)用:用于分類未來(lái)或未知對(duì)象評(píng)估模型的準(zhǔn)確率測(cè)試樣本的已知標(biāo)號(hào)與根據(jù)模型獲得的分類結(jié)果作比較。準(zhǔn)確率定義為正確被模型分類的測(cè)試樣本的百分比測(cè)試集獨(dú)立于訓(xùn)練集,否則,學(xué)習(xí)模型傾向于過(guò)分適合(over-fitting)數(shù)據(jù)如果準(zhǔn)確率可被接受,使用模型用于分類類標(biāo)號(hào)未知的數(shù)據(jù)。4廈門大學(xué)軟件學(xué)院ClassificationProcess(1):ModelConstructionTrainingDataClassificationAlgorithmsIFrank=‘professor’ORyears>6THENtenured=‘yes’Classifier(Model)5廈門大學(xué)軟件學(xué)院ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataUnseenData(Jeff,Professor,4)Tenured?6廈門大學(xué)軟件學(xué)院決策樹(shù)算法基本算法(貪心算法),樹(shù)的構(gòu)建方式:自頂向下遞歸的各個(gè)擊破方式樹(shù)以代表訓(xùn)練樣本的單個(gè)節(jié)點(diǎn)開(kāi)始如果樣本都在同一類中,則節(jié)點(diǎn)為葉子節(jié)點(diǎn),并用該類標(biāo)記否則,選擇能夠最好地將樣本分類的屬性(稱為測(cè)試屬性,必須是離散值的)對(duì)測(cè)試屬性的每個(gè)已知值,創(chuàng)建一個(gè)分支,并據(jù)此劃分樣本遞歸形成每個(gè)劃分上的樣本決策樹(shù)7廈門大學(xué)軟件學(xué)院遞歸劃分步驟僅當(dāng)下列條件之一成立時(shí)立即停止給定節(jié)點(diǎn)的所有樣本屬于同一個(gè)類沒(méi)有剩余屬性可作進(jìn)一步劃分樣本——majorityvotingisemployedforclassifyingtheleaf沒(méi)有剩余的樣本age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno8廈門大學(xué)軟件學(xué)院AttributeSelectionMeasure:InformationGain(ID3/C4.5)選擇具有最高信息增益(informationgain
)的屬性。pi
是D中任意元組屬于類Ci的概率,用估計(jì)|Ci,D|/|D|Expectedinformation(熵:entropy)對(duì)D中元組分類所需的期望信息:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA9廈門大學(xué)軟件學(xué)院AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”
means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,10廈門大學(xué)軟件學(xué)院ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point11廈門大學(xué)軟件學(xué)院GainRatioforAttributeSelection(C4.5)信息增益度量偏向具有許多輸出的測(cè)試C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/0.926=0.031Theattributewiththemaximumgainratioisselectedasthesplittingattribute12廈門大學(xué)軟件學(xué)院Gini指標(biāo)(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas
wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedas不純度降低:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)13廈門大學(xué)軟件學(xué)院Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes14廈門大學(xué)軟件學(xué)院ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions15廈門大學(xué)軟件學(xué)院OverfittingandTreePruningOverfitting:過(guò)份擬合AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:提前停止樹(shù)的構(gòu)造,如果劃分一個(gè)節(jié)點(diǎn)的元組導(dǎo)致低于預(yù)定義閾值的分裂,則結(jié)束進(jìn)一步的劃分DifficulttochooseanappropriatethresholdPostpruning:由“完全生長(zhǎng)”的樹(shù)減去子樹(shù)(用樹(shù)葉替代對(duì)應(yīng)的子樹(shù))Useasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”16廈門大學(xué)軟件學(xué)院決策樹(shù)歸納基本算法的加強(qiáng)允許連續(xù)值屬性通過(guò)將連續(xù)屬性值劃分到離散值空間,以動(dòng)態(tài)定義新的離散值屬性處理缺失屬性值缺失值賦值以屬性的最常見(jiàn)值缺失值賦值以屬性的最或然值屬性構(gòu)建在給定的屬性上創(chuàng)建新的屬性,改變給定屬性的首先表示,是數(shù)據(jù)變換的一種形式解決決策樹(shù)歸納可能面臨的碎片、復(fù)制和重復(fù)問(wèn)題。17廈門大學(xué)軟件學(xué)院大型數(shù)據(jù)庫(kù)中的分類分類問(wèn)題是統(tǒng)計(jì)學(xué)和機(jī)器學(xué)習(xí)研究者研究的經(jīng)典難題的一個(gè)擴(kuò)展??缮炜s性:在具有數(shù)以百萬(wàn)計(jì)的樣本和成千上百個(gè)屬性的數(shù)據(jù)集上,以可被接受的速度作分類為什么在數(shù)據(jù)挖掘中使用決策樹(shù)歸納?學(xué)習(xí)速度相當(dāng)快(與其他分類方法比較)分類規(guī)則簡(jiǎn)單易懂對(duì)數(shù)據(jù)庫(kù)的訪問(wèn)可使用SQL查詢可比較分類的準(zhǔn)確度18廈門大學(xué)軟件學(xué)院可伸縮的決策樹(shù)歸納方法SLIQ(EDBT’96—Mehtaetal.)buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT(VLDB’96—J.Shaferetal.)constructsanattributelistdatastructurePUBLIC(VLDB’98—Rastogi&Shim)integratestreesplittingandtreepruning:stopgrowingthetreeearlierRainForest(VLDB’98—Gehrke,Ramakrishnan&Ganti)separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreebuildsanAVC-list(attribute,value,classlabel)19廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary20廈門大學(xué)軟件學(xué)院貝葉斯分類貝葉斯分類是統(tǒng)計(jì)學(xué)分類方法貝葉斯分類基于貝葉斯定理樸素貝葉斯分類假定一個(gè)屬性值對(duì)給定類的影響?yīng)毩⒂谄渌麑傩缘闹?類條件獨(dú)立)。貝葉斯信念網(wǎng)絡(luò)能表示屬性子集間的依賴,也可用于分類。21廈門大學(xué)軟件學(xué)院貝葉斯定理:BasicsX
是類標(biāo)識(shí)未知的數(shù)據(jù)樣本(或數(shù)據(jù)元組)如:35歲收入$4000的顧客H
是數(shù)據(jù)樣本X屬于某特定類C的某種假定。如:假設(shè)顧客將購(gòu)買計(jì)算機(jī)P(H/X):條件X下H的后驗(yàn)概率如:知道顧客年齡與收入時(shí),顧客X將購(gòu)買計(jì)算機(jī)的概率P(H):H的先驗(yàn)概率,即在我們觀察任何樣本前的初始概率,它反應(yīng)了背景知識(shí)。如:任意給定的顧客將購(gòu)買計(jì)算機(jī)的概率。P(X):被觀察的樣本數(shù)據(jù)的概率如:顧客中年齡35歲收入$4000的概率P(X|H):條件H下,X的后驗(yàn)概率如:已知顧客購(gòu)買計(jì)算機(jī),該顧客為35歲收入$4000的概率貝葉斯定理:22廈門大學(xué)軟件學(xué)院TowardsNa?veBayesianClassifierLetDbeatrainingsetoftuplesandtheirassociatedclasslabels,andeachtupleisrepresentedbyann-DattributevectorX=(x1,x2,…,xn)SupposetherearemclassesC1,C2,…,Cm.Classificationistoderivethemaximumposteriori,i.e.,themaximalP(Ci|X)ThiscanbederivedfromBayes’theoremSinceP(X)isconstantforallclasses,onlyneedstobemaximized23廈門大學(xué)軟件學(xué)院DerivationofNa?veBayesClassifierAsimplifiedassumption:attributesareconditionallyindependent(i.e.,nodependencerelationbetweenattributes):Thisgreatlyreducesthecomputationcost:OnlycountstheclassdistributionIfAkiscategorical,P(xk|Ci)isthe#oftuplesinCihavingvaluexkforAkdividedby|Ci,D|(#oftuplesofCiinD)IfAkiscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμandstandarddeviationσandP(xk|Ci)is24廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:TrainingDatasetClass:C1:buys_computer=‘yes’C2:buys_computer=‘no’DatasampleX=(age<=30,Income=medium,Student=yesCredit_rating=Fair)25廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:AnExampleP(Ci):P(buys_computer=“yes”)=9/14=0.643P(buys_computer=“no”)=5/14=0.357ComputeP(X|Ci)foreachclassP(age=“<=30”|buys_computer=“yes”)=2/9=0.222P(age=“<=30”|buys_computer=“no”)=3/5=0.6P(income=“medium”|buys_computer=“yes”)=4/9=0.444P(income=“medium”|buys_computer=“no”)=2/5=0.4P(student=“yes”|buys_computer=“yes)=6/9=0.667P(student=“yes”|buys_computer=“no”)=1/5=0.2P(credit_rating=“fair”|buys_computer=“yes”)=6/9=0.667P(credit_rating=“fair”|buys_computer=“no”)=2/5=0.4X=(age<=30,income=medium,student=yes,credit_rating=fair)
P(X|Ci):P(X|buys_computer=“yes”)=0.222x0.444x0.667x0.667=0.044P(X|buys_computer=“no”)=0.6x0.4x0.2x0.4=0.019P(X|Ci)*P(Ci):P(X|buys_computer=“yes”)*P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”)*P(buys_computer=“no”)=0.007Therefore,Xbelongstoclass(“buys_computer=yes”)
26廈門大學(xué)軟件學(xué)院Avoidingthe0-ProbabilityProblemNa?veBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero
Ex.Supposeadatasetwith1000tuples,income=low(0),income=medium(990),andincome=high(10),UseLaplaciancorrection(orLaplacianestimator)Adding1toeachcaseProb(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003The“corrected”prob.estimatesareclosetotheir“uncorrected”counterparts27廈門大學(xué)軟件學(xué)院Na?veBayesianClassifier:CommentsAdvantagesEasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption:classconditionalindependence,thereforelossofaccuracyPractically,dependenciesexistamongvariablesHowtodealwiththesedependencies?BayesianBeliefNetworks28廈門大學(xué)軟件學(xué)院貝葉斯網(wǎng)絡(luò)貝葉斯信念網(wǎng)絡(luò)允許在變量的子集間定義類條件的獨(dú)立性因果關(guān)系的圖形建模表示了變量間的依賴關(guān)系說(shuō)明聯(lián)合條件概率分布節(jié)點(diǎn):隨機(jī)變量(離散或連續(xù))弧:概率依賴X,YaretheparentsofZ,andYistheparentofPNodependencybetweenZandP有向無(wú)環(huán)圖29廈門大學(xué)軟件學(xué)院AnExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLC~LC(FH,S)(FH,~S)(~FH,S)(~FH,~S)0.10.9BayesianBeliefNetworks對(duì)于每個(gè)變量,信念網(wǎng)絡(luò)有一個(gè)條件概率表CPT,P(Y|Parent(Y))30廈門大學(xué)軟件學(xué)院訓(xùn)練貝葉斯信念網(wǎng)絡(luò)許多情況都有可能如果網(wǎng)絡(luò)結(jié)構(gòu)已知并且變量是可見(jiàn)的:只需計(jì)算CPT(條件概率表)網(wǎng)絡(luò)結(jié)構(gòu)已知,但有些變量隱藏在樣本中:使用梯度下降方法訓(xùn)練信念網(wǎng)絡(luò),類似于神經(jīng)網(wǎng)絡(luò)的學(xué)習(xí)網(wǎng)絡(luò)結(jié)構(gòu)未知,所有變量都可見(jiàn):從模型空間查找,重新構(gòu)建圖的拓?fù)浣Y(jié)構(gòu)未知網(wǎng)絡(luò)結(jié)構(gòu),所有變量都隱藏:還未有好的算法31廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自學(xué))ClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary32廈門大學(xué)軟件學(xué)院UsingIF-THENRulesforClassification以IF-THEN規(guī)則的形式來(lái)表示知識(shí)R:IFage=youthANDstudent=yesTHENbuys_computer=yes規(guī)則前件或前提vs.規(guī)則結(jié)論規(guī)則的覆蓋率與準(zhǔn)確率ncovers=#oftuplescoveredbyR(規(guī)則前件被滿足)ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|/*D:trainingdataset*/accuracy(R)=ncorrect/ncovers如果元組被多條規(guī)則所“觸發(fā)”,則需要解決沖突規(guī)模序(Sizeordering):assignthehighestprioritytothetriggeringrulesthathasthe“toughest”requirement(i.e.,withthemostattributetest)基于類的序(Class-basedordering):類按重要性遞減排序規(guī)則序(Rule-basedordering(decisionlist)):rulesareorganizedintoonelongprioritylist,accordingtosomemeasureofrulequalityorbyexperts33廈門大學(xué)軟件學(xué)院age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno從決策樹(shù)中提取規(guī)則以IF-THEN規(guī)則的形式來(lái)表示知識(shí)對(duì)從根到樹(shù)葉的每一條路徑創(chuàng)建一條規(guī)則。沿著給定路徑上的每個(gè)屬性——值對(duì),形成規(guī)則前件(“IF”部分)葉子節(jié)點(diǎn)包含類預(yù)測(cè),形成規(guī)則后件(“THEN”部分)IF-THEN規(guī)則易于理解。ExampleIFage=“<=30”ANDstudent=“no”THENbuys_computer=“no”IFage=“<=30”ANDstudent=“yes”THENbuys_computer=“yes”IFage=“31…40” THENbuys_computer=“yes”IFage=“>40”ANDcredit_rating=“excellent”THENbuys_computer=“yes”IFage=“<=30”ANDcredit_rating=“fair”THENbuys_computer=“no”34廈門大學(xué)軟件學(xué)院RuleExtractionfromtheTrainingData順序覆蓋算法:直接從訓(xùn)練數(shù)據(jù)提取IF-THEN規(guī)則,不必產(chǎn)生決策樹(shù)典型的順序覆蓋算法:FOIL,AQ,CN2,RIPPER規(guī)則被順序?qū)W習(xí),理想地,在為C類學(xué)習(xí)規(guī)則時(shí),希望規(guī)則覆蓋C類的所有(或許多)訓(xùn)練元組,并且沒(méi)有(或很少)覆蓋其他類的原則Steps:一次學(xué)習(xí)一條規(guī)則每當(dāng)學(xué)習(xí)一個(gè)規(guī)則,就刪除該規(guī)則覆蓋的元組對(duì)剩下的元組重復(fù)該過(guò)程,直到遇到終止條件(無(wú)訓(xùn)練樣本或規(guī)則質(zhì)量低于設(shè)定閾值)35廈門大學(xué)軟件學(xué)院HowtoLearn-One-Rule?Starwiththemostgeneralrulepossible:condition=emptyAddingnewattributesbyadoptingagreedydepth-firststrategyPickstheonethatmostimprovestherulequalityRule-Qualitymeasures:considerbothcoverageandaccuracyFoil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingconditionItfavorsrulesthathavehighaccuracyandcovermanypositivetuplesRulepruningbasedonanindependentsetoftesttuplesPos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_PruneishigherfortheprunedversionofR,pruneR36廈門大學(xué)軟件學(xué)院37廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary38廈門大學(xué)軟件學(xué)院分類:
預(yù)測(cè)分類的類標(biāo)識(shí)典型應(yīng)用{credithistory,salary}->creditapproval(Yes/No){Temp,Humidity}-->Rain(Yes/No)分類的數(shù)學(xué)映射數(shù)學(xué)方式:39廈門大學(xué)軟件學(xué)院線性分類二進(jìn)制分類問(wèn)題紅線上方的數(shù)據(jù)屬于類’x’紅線下方的數(shù)據(jù)屬于類‘o’例子–SVM,Perceptron,ProbabilisticClassifiersxxxxxxxxxxooooooooooooo40廈門大學(xué)軟件學(xué)院DiscriminativeClassifiers優(yōu)點(diǎn)預(yù)測(cè)準(zhǔn)確率一般更高(與貝葉斯方法比較)對(duì)在樣本包含錯(cuò)誤信息的數(shù)據(jù)集上工作時(shí),具有較強(qiáng)的魯棒性快速評(píng)估被學(xué)習(xí)的目標(biāo)函數(shù)(貝葉斯網(wǎng)絡(luò)通常較慢)限制漫長(zhǎng)的訓(xùn)練時(shí)間網(wǎng)絡(luò)構(gòu)建和訓(xùn)練過(guò)程中,參數(shù)設(shè)置多,很難確定可理解性差41廈門大學(xué)軟件學(xué)院ClassificationbyBackpropagation后向傳播(Backpropagation):一種神經(jīng)網(wǎng)絡(luò)(neuralnetwork)學(xué)習(xí)算法最早由心理學(xué)家和神經(jīng)學(xué)家開(kāi)創(chuàng),旨在尋求開(kāi)發(fā)和測(cè)試神經(jīng)的計(jì)算模擬神經(jīng)網(wǎng)絡(luò):一組連接的輸入輸出單元,每個(gè)連接都與一個(gè)權(quán)重(weight)關(guān)聯(lián)在學(xué)習(xí)過(guò)程中,通過(guò)調(diào)整權(quán)重,能預(yù)測(cè)輸入元組的正確類標(biāo)號(hào)又稱“連接者學(xué)習(xí)”(connectionistlearning)42廈門大學(xué)軟件學(xué)院NeuralNetworkasaClassifierWeaknessLongtrainingtimeRequireanumberofparameterstypicallybestdeterminedempirically,e.g.,thenetworktopologyor``structure."Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehindthelearnedweightsandof``hiddenunits"inthenetworkStrengthHightolerancetonoisydataAbilitytoclassifyuntrainedpatternsWell-suitedforcontinuous-valuedinputsandoutputsSuccessfulonawidearrayofreal-worlddataAlgorithmsareinherentlyparallelTechniqueshaverecentlybeendevelopedfortheextractionofrulesfromtrainedneuralnetworks43廈門大學(xué)軟件學(xué)院神經(jīng)元(感知器)Then-dimensionalinputvectorxismappedintovariableybymeansofthescalarproductandanonlinearfunctionmappingmk-fweightedsumInputvectorxoutputyActivationfunctionweightvectorw?w0w1wnx0x1xn44廈門大學(xué)軟件學(xué)院AMulti-LayerFeed-ForwardNeuralNetworkOutputlayerInputlayerHiddenlayerOutputvectorInputvector:Xwij45廈門大學(xué)軟件學(xué)院HowAMulti-LayerNeuralNetworkWorks?網(wǎng)絡(luò)的輸入對(duì)應(yīng)于每個(gè)訓(xùn)練元組測(cè)量的屬性輸入同時(shí)提供給稱作輸入層(inputlayer)的單元層。輸入層同時(shí)加權(quán)提供給隱藏層(hiddenlayer)隱藏層的數(shù)量是任意的,實(shí)踐中通常只有一層最后一個(gè)隱藏層的加權(quán)輸出作為輸出層的單元的輸入,輸出層發(fā)布給定元組的網(wǎng)絡(luò)預(yù)測(cè)。前饋網(wǎng)絡(luò):權(quán)重都不回送到輸入單元或前一層的輸出單元的網(wǎng)絡(luò)從統(tǒng)計(jì)學(xué)觀點(diǎn)來(lái)看,神經(jīng)網(wǎng)絡(luò)進(jìn)行的時(shí)非線性回歸。給定足夠多的隱藏單元和足夠的訓(xùn)練樣本,多層前饋網(wǎng)絡(luò)可以逼近任何函數(shù)。46廈門大學(xué)軟件學(xué)院DefiningaNetworkTopologyFirstdecidethenetworktopology:#ofunitsintheinputlayer,#ofhiddenlayers(if>1),#ofunitsineachhiddenlayer,and#ofunitsintheoutputlayerNormalizingtheinputvaluesforeachattributemeasuredinthetrainingtuplesto[0.0—1.0]Oneinputunitperdomainvalue,eachinitializedto0Output,ifforclassificationandmorethantwoclasses,oneoutputunitperclassisusedOnceanetworkhasbeentrainedanditsaccuracyisunacceptable,repeatthetrainingprocesswithadifferentnetworktopologyoradifferentsetofinitialweights47廈門大學(xué)軟件學(xué)院Backpropagation迭代地處理訓(xùn)練元組數(shù)據(jù)集,將每個(gè)元組的網(wǎng)絡(luò)預(yù)測(cè)與實(shí)際已知的目標(biāo)值做比較對(duì)于每個(gè)訓(xùn)練樣本,修改權(quán)重使網(wǎng)絡(luò)預(yù)測(cè)和實(shí)際目標(biāo)間的均方誤差最小修改“后向”進(jìn)行:由輸出層經(jīng)由每個(gè)隱藏藏,到第一個(gè)隱藏層StepsInitializeweights(tosmallrandom#s)andbiasesinthenetworkPropagatetheinputsforward(byapplyingactivationfunction)Backpropagatetheerror(byupdatingweightsandbiases)Terminatingcondition(whenerrorisverysmall,etc.)48廈門大學(xué)軟件學(xué)院49廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)(自學(xué))AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary50廈門大學(xué)軟件學(xué)院SVM—SupportVectorMachines線性和非線性數(shù)據(jù)的新分類方法使用非線性映射,將原訓(xùn)練數(shù)據(jù)映射到較高的維在新的維上,它搜索線性最佳分離超平面(決策邊界)
使用一個(gè)適當(dāng)?shù)淖銐蚋呔S的非線性映射,兩類數(shù)據(jù)總可以被超平面所分開(kāi)。SVM使用支持向量
(“essential”trainingtuples)和邊緣(definedbythesupportvectors)發(fā)現(xiàn)超平面51廈門大學(xué)軟件學(xué)院SVM—HistoryandApplicationsVapnikandcolleagues(1992)—groundworkfromVapnik&Chervonenkis’statisticallearningtheoryin1960sFeatures:trainingcanbeslowbutaccuracyishighowingtotheirabilitytomodelcomplexnonlineardecisionboundaries(marginmaximization)UsedbothforclassificationandpredictionApplications:handwrittendigitrecognition,objectrecognition,speakeridentification,benchmarkingtime-seriespredictiontests52廈門大學(xué)軟件學(xué)院SVM—GeneralPhilosophySupportVectorsSmallMarginLargeMargin53廈門大學(xué)軟件學(xué)院SVM—MarginsandSupportVectors54廈門大學(xué)軟件學(xué)院SVM—WhenDataIsLinearlySeparablemLetdataDbe(X1,y1),…,(X|D|,y|D|),whereXiisthesetoftrainingtuplesassociatedwiththeclasslabelsyiThereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofindthebestone(theonethatminimizesclassificationerroronunseendata)SVMsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginalhyperplane(MMH)55廈門大學(xué)軟件學(xué)院SVM—LinearlySeparableAseparatinghyperplanecanbewrittenasW●X+b=0whereW={w1,w2,…,wn}isaweightvectorandbascalar(bias)For2-Ditcanbewrittenasw0+w1x1+w2x2=0Thehyperplanedefiningthesidesofthemargin:H1:w0+w1x1+w2x2≥1foryi=+1,andH2:w0+w1x1+w2x2≤–1foryi=–1AnytrainingtuplesthatfallonhyperplanesH1orH2(i.e.,the
sidesdefiningthemargin)aresupportvectors56廈門大學(xué)軟件學(xué)院SVM—LinearlyInseparableTransformtheoriginalinputdataintoahigherdimensionalspaceSearchforalinearseparatinghyperplaneinthenewspace57廈門大學(xué)軟件學(xué)院SVM—KernelfunctionsInsteadofcomputingthedotproductonthetransformeddatatuples,itismathematicallyequivalenttoinsteadapplyingakernelfunctionK(Xi,Xj)totheoriginaldata,i.e.,K(Xi,Xj)=Φ(Xi)Φ(Xj)TypicalKernelFunctionsSVMcanalsobeusedforclassifyingmultiple(>2)classesandforregressionanalysis(withadditionaluserparameters)58廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)Associativeclassification(自學(xué))
Lazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary59廈門大學(xué)軟件學(xué)院AssociativeClassificationAssociativeclassification關(guān)聯(lián)規(guī)則的產(chǎn)生和分析旨在用于分類搜索頻繁模式與類標(biāo)號(hào)之間的強(qiáng)關(guān)聯(lián)規(guī)則。Classification:BasedonevaluatingasetofrulesintheformofP1^p2…^pl
“Aclass=C”(conf,sup)Whyeffective?關(guān)聯(lián)規(guī)則考察了多屬性間的高置信度關(guān)聯(lián),可以克服決策樹(shù)歸納一次只能考慮一個(gè)屬性的局限性。關(guān)聯(lián)分類比傳統(tǒng)的分類方法更準(zhǔn)確60廈門大學(xué)軟件學(xué)院TypicalAssociativeClassificationMethodsCBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD’98)MineassociationpossiblerulesintheformofCond-set(asetofattribute-valuepairs)classlabelBuildclassifier:OrganizerulesaccordingtodecreasingprecedencebasedonconfidenceandthensupportCMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM’01)Classification:StatisticalanalysisonmultiplerulesCPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM’03)Generationofpredictiverules(FOIL-likeanalysis)Highefficiency,accuracysimilartoCMARRCBT(Miningtop-kcoveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD’05)Explorehigh-dimensionalclassification,usingtop-krulegroupsAchievehighclassificationaccuracyandhighrun-timeefficiency61廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)(自學(xué))OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary62廈門大學(xué)軟件學(xué)院Lazyvs.EagerLearning懶惰與急切學(xué)習(xí)法Eagerlearning(theabovediscussedmethods):Givenasetoftrainingset,constructsaclassificationmodelbeforereceivingnew(e.g.,test)datatoclassifyLazylearning(e.g.,instance-basedlearning):Simplystorestrainingdata(oronlyminorprocessing)andwaitsuntilitisgivenatesttupleLazy:lesstimeintrainingbutmoretimeinpredicting63廈門大學(xué)軟件學(xué)院LazyLearner:Instance-BasedMethodsInstance-basedlearning:Storetrainingexamplesanddelaytheprocessing(“l(fā)azyevaluation”)untilanewinstancemustbeclassifiedTypicalapproachesk-nearestneighborapproachInstancesrepresentedaspointsinaEuclideanspace.Case-basedreasoningUsessymbolicrepresentationsandknowledge-basedinference64廈門大學(xué)軟件學(xué)院Thek-NearestNeighborAlgorithmAllinstancescorrespondtopointsinthen-DspaceThenearestneighboraredefinedintermsofEuclideandistance,dist(X1,X2)Targetfunctioncouldbediscrete-orreal-valuedFordiscrete-valued,k-NNreturnsthemostcommonvalueamongthektrainingexamplesnearestto
xq
._+_xq+__+__+65廈門大學(xué)軟件學(xué)院Case-BasedReasoning(CBR)CBR:UsesadatabaseofproblemsolutionstosolvenewproblemsStoresymbolicdescription(tuplesorcases)—notpointsinaEuclideanspaceApplications:Customer-service(product-relateddiagnosis),legalrulingMethodologyInstancesrepresentedbyrichsymbolicdescriptions(e.g.,functiongraphs)Searchforsimilarcases,multipleretrievedcasesmaybecombinedTightcouplingbetweencaseretrieval,knowledge-basedreasoning,andproblemsolvingChallengesFindagoodsimilaritymetricIndexingbasedonsyntacticsimilaritymeasure,andwhenfailure,backtracking,andadaptingtoadditionalcases66廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)Otherclassificationmethods(自學(xué))PredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary67廈門大學(xué)軟件學(xué)院GeneticAlgorithms(GA)GeneticAlgorithm:basedonananalogytobiologicalevolutionAninitialpopulationiscreatedconsistingofrandomlygeneratedrulesEachruleisrepresentedbyastringofbitsE.g.,ifA1and?A2thenC2canbeencodedas100Ifanattributehask>2values,kbitscanbeusedBasedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistofthefittestrulesandtheiroffspringsThefitnessofaruleisrepresentedbyitsclassificationaccuracyonasetoftrainingexamplesOffspringsaregeneratedbycrossoverandmutationTheprocesscontinuesuntilapopulationPevolveswheneachruleinPsatisfiesaprespecifiedthresholdSlowbuteasilyparallelizable68廈門大學(xué)軟件學(xué)院RoughSetApproachRoughsetsareusedtoapproximatelyor“roughly”defineequivalentclassesAroughsetforagivenclassCisapproximatedbytwosets:alowerapproximation(certaintobeinC)andanupperapproximation(cannotbedescribedasnotbelongingtoC)Findingtheminimalsubsets(reducts)ofattributesforfeaturereductionisNP-hardbutadiscernibilitymatrix(whichstoresthedifferencesbetweenattributevaluesforeachpairofdatatuples)isusedtoreducethecomputationintensity69廈門大學(xué)軟件學(xué)院FuzzySetApproachesFuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership(suchasusingfuzzymembershipgraph)Attributevaluesareconvertedtofuzzyvaluese.g.,incomeismappedintothediscretecategories{low,medium,high}withfuzzyvaluescalculatedForagivennewsample,morethanonefuzzyvaluemayapplyEachapplicablerulecontributesavoteformembershipinthecategoriesTypically,thetruthvaluesforeachpredictedcategoryaresummed,andthesesumsarecombined70廈門大學(xué)軟件學(xué)院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary71廈門大學(xué)軟件學(xué)院WhatIsPrediction?(Numerical)predictionissimilartoclassificationconstructamodelusemodeltopredictcontinuousororderedvalueforagiveninputPredictionisdifferentfromclassificationClassificationreferstopredictcategoricalclasslabelPredictionmodelscontinuous-valuedfunctionsMajormethodforprediction:regression回歸分析可以用來(lái)對(duì)一個(gè)或多個(gè)獨(dú)立或預(yù)測(cè)變量和一個(gè)(連續(xù)值的)依賴或相應(yīng)變量之間的聯(lián)系建模。RegressionanalysisLinearandmultipleregressionNon-linearregressionOtherregressionmethods:generalizedlinearmodel,Poissonregression,log-linearmodels,regressiontrees72廈門大學(xué)軟件學(xué)院LinearRegression
Linearregression:involves
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025屆安徽省淮南一中高考英語(yǔ)倒計(jì)時(shí)模擬卷含解析
- 《人的正確思想是從哪里來(lái)的?》課件 2023-2024學(xué)年統(tǒng)編版高中語(yǔ)文選擇性必修中冊(cè)
- 2025屆吉林省延邊高三第三次模擬考試數(shù)學(xué)試卷含解析
- 湖北省襄州一中棗陽(yáng)一中等四校2025屆高考仿真卷英語(yǔ)試題含解析
- 云南省楚雄市古城二中2025屆高考適應(yīng)性考試語(yǔ)文試卷含解析
- 2025屆廣東省汕頭市六都中學(xué)高三最后一卷語(yǔ)文試卷含解析
- 專題02 單項(xiàng)選擇(同義替換)50題(原卷版)-2024-2025學(xué)年七年級(jí)英語(yǔ)上學(xué)期期末名校真題進(jìn)階練(深圳專用)
- 吉林省延邊二中2025屆高考考前提分?jǐn)?shù)學(xué)仿真卷含解析
- 浙江省溫州市示范名校2025屆高三第三次測(cè)評(píng)語(yǔ)文試卷含解析
- 2025屆山東省濟(jì)南市外國(guó)語(yǔ)學(xué)校高考仿真卷語(yǔ)文試題含解析
- 1神州謠 課件(共50張PPT)
- 國(guó)家開(kāi)放大學(xué)思想道德與法治社會(huì)實(shí)踐作業(yè)集合6篇
- 小學(xué)侵害未成年人強(qiáng)制報(bào)告制度
- 2023年飛行員基礎(chǔ)知識(shí)考試題庫(kù)(500題版)
- 公租房運(yùn)營(yíng)管理服務(wù)投標(biāo)方案
- 能源管理系統(tǒng)EMS用戶需求說(shuō)明書
- 人工智能對(duì)中學(xué)教學(xué)的影響與應(yīng)對(duì)策略
- 2668-人員招聘與培訓(xùn)實(shí)務(wù)
- 閉合導(dǎo)線自動(dòng)計(jì)算表
- 股權(quán)投資基金(有限合伙)設(shè)立、投資及退出交易安排法律意見(jiàn)書
- 分管學(xué)校安全、德育、后勤等業(yè)務(wù)副校長(zhǎng)述職報(bào)告
評(píng)論
0/150
提交評(píng)論