數(shù)據(jù)挖掘?qū)д揑risKDD分析教材

上傳人：y*** IP屬地：天津上傳時(shí)間：2021-06-12 格式：DOC 頁數(shù)：14 大?。?23.50KB 積分：18 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩9頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、題目學(xué)院名稱信息科學(xué)與技術(shù)學(xué)院專業(yè)名稱學(xué)生姓名學(xué)生學(xué)號(hào) 指導(dǎo)教師實(shí)習(xí)地點(diǎn)計(jì)算機(jī)科學(xué)與技術(shù)何東升201413030119成都理工大學(xué)二O 六年9月iris數(shù)據(jù)集的KDD實(shí)驗(yàn)第1章、實(shí)驗(yàn)?zāi)康募皟?nèi)容i.i實(shí)習(xí)目的知識(shí)發(fā)現(xiàn)(KDD： Knowledge Discovery in Database)是從數(shù)據(jù)集中識(shí)別出有效的、新穎的、潛在仃用的，以及最終町理解的模式的非平凡過程。知識(shí)發(fā)現(xiàn)將信息變?yōu)橹R(shí)，從數(shù)據(jù)礦山中找到蘊(yùn)藏的知識(shí)金塊，將為知識(shí)創(chuàng)新和知識(shí)經(jīng)濟(jì)的發(fā)展作出貢獻(xiàn)。該術(shù)語于 1989年出現(xiàn)，F(xiàn)ayyad定義為*DD“是從數(shù)據(jù)集中識(shí)別出有效的、新穎的、潛在有用的，以及最終可理解的模式的非

2、平凡過程。KDD的目的是利用所發(fā)現(xiàn)的模式解決實(shí)際問題，可被人理解的模式幫助人們理解模式中包含的信息，從而更好的評估和利用。1.2算法的核心思想作為-個(gè)KDD的工程而言,KDD通常包含一系列復(fù)雜的挖掘步驟. FayyadPiatetsky-Shapiro 和 Smyth 在 1996 年合作發(fā)布的論文From Data Mining to knowledge discovery中總結(jié)出了 KDD包含的5個(gè)最基本步驟(如圖).1： selection:在第一個(gè)步驟中我們往往要先知道什么樣的數(shù)據(jù)町以應(yīng)用于我們的 KDD工程中.2: pre-processing:當(dāng)采集到數(shù)據(jù)后，卜一步必須要做的事情

3、是對數(shù)據(jù)進(jìn)行預(yù)處理，盡量消除數(shù)據(jù)中存在的錯(cuò)誤以及缺失信息.3: transformation:轉(zhuǎn)換數(shù)據(jù)為數(shù)據(jù)挖掘工具所需的格式這一步川以使可注果更加理想化.4: data mining:應(yīng)用數(shù)據(jù)挖掘工具一5:interpretation/evaluation: 了解以及評估數(shù)據(jù)挖掘結(jié)果一1.3實(shí)驗(yàn)軟件：Weka3-9.數(shù)據(jù)集來源：http:/archive, ics. uci. edu/ml/datasets/l r i s第2章、實(shí)驗(yàn)過程2.1數(shù)據(jù)準(zhǔn)備1 從uci的數(shù)據(jù)集宜網(wǎng)下載Wis的數(shù)據(jù)源2抽取數(shù)據(jù)，清洗數(shù)據(jù)，變換數(shù)據(jù)3. iris的數(shù)據(jù)集如圖Iris Data SetDownloa

4、d Data Folder. Data Set DescriptionAbsiiact Famousfrom Fish 193FData Set Characteristics:MuftivariatiNumber of Instances:150Area:LifeAttribute Characteristics:RealNumber of Attributes:4Date Donated19眇0701Associated Tasks:ClarificationMigng Values?NoNumber of Web Hits:1089640Iris也稱琦尾花卉數(shù)據(jù)集，是農(nóng)匕匝變量分析的數(shù)

5、據(jù)集。通過花萼長度，花萼寬度，花瓣長度，花瓣寬度4個(gè)屬性預(yù)測鳶尾花卉屬于(Setosa, Versicolour, Virginica)三個(gè) 種類中的哪一類。2.2實(shí)驗(yàn)過程2.2.1.建模(1)C4.5數(shù)據(jù)挖掘算法使用weka進(jìn)行有指導(dǎo)的學(xué)習(xí)訓(xùn)練，選擇C4. 5數(shù)據(jù)挖掘算法，在Weka中名為J48,將test opt ions設(shè)置為Percent age split ,使用默認(rèn)白分比66%。選擇class作為輸出屬性。如圖所示:ClassifierChoose J J48 C 0.25 -M 2Test optionsO Use training setSetFolds 10% 66O S

6、upplied test setj Cross-validationO Percentage splitMore options.(Nom) classStartStop2.設(shè)置完成后點(diǎn)擊start開始執(zhí)彳亍 (2) Simple KMeans 算法1加載數(shù)據(jù)到Weka,切換到Cluster選項(xiàng)卡，選擇Simple KMeans V?法、2設(shè)置算法參數(shù)，顯示標(biāo)準(zhǔn)差，迭代次數(shù)設(shè)為5000次，其他默認(rèn)。簇?cái)?shù)選擇3,因?yàn)榛ǖ姆N類為3。如下圖所示3在Cluster Mode面板選擇評估數(shù)據(jù)為Use trainin set.attribu,忽略 class 屬性o并單擊Ignore4點(diǎn)擊start按鈕，

7、執(zhí)行程序49n0 9408 0 0396 0 15798 8979 % 33 4091 %51TP RateFP RatePreasionClass1 0000 0001 000Ins-setosa1 0000 0630 905Ins-v*rsicolor0 8820 0001 000Ins-virginicaWeighted Avg0 9610 0230 965=Detaled Accuracy By Class =Confusion Matnx =RecallF-MeasureMCCROC Area PRC Area1 0001 0001 0001 0001 0001.0000 9500

8、9210 9690 9050 8820 93809130 9670 9380 9610 9610 94209770 944第三章實(shí)驗(yàn)結(jié)果及分析3.1 C45結(jié)果分析1.運(yùn)行結(jié)果=Run information =Scheme Relation Instances Atti 讓 utes;weka.classifiers trees J48 C 0 25 -M 2 ms 1505sepal lengthsepal widthpetal lengthpetal wiatliclassTest modesplit 66 0% train, remainder test=Classifier mode

9、l (full training set)=J48 pruned tree petal width 06petal width = 17petal width 1 5: Ins-versicolor (3 0/1.0) wiath 1.7 Ins-virgimca (46.0/1 0)Number of Leaves : 5Size of the tree :9Time taken to build model 0 01 seconds=Evaluation on test split =Time taken to test model on training split 0 secondsS

10、ummaryCorrectly Classified Instances Incorrectly Classified Instances Kappa statisticMean absolute errorRoot mean sauared error Relative absolute errorRoot relative squared error Total Number of Instancesa b c (1096 53 .；= e . 1. t . . !.才遼.：hit. :iI1it; . . . “ ! ! . . a * % :/： . t !： i:ep4l leiHi

11、tlilenuttiwicMIipotl lettultipetal widtli從實(shí)驗(yàn)結(jié)果可以看出分出的類為3個(gè)且比例與元數(shù)據(jù)的class的比例1:1:1的比例不是很相近。從C4. 5的結(jié)果來看pental width和 pental length更加符合，重新選擇屬性，僅選擇pental width和 pental length結(jié)果如下Run information kLfeansSchemeweka clusterers10000 -min-density 2 0 -tl -1 25 -I 5009 -num-slots 1 -S 10 Relauon:InstancesAttribu

12、teseKKleans -init 0 -max-candidates 100 -penodic-piuning0 V N 3 -A Bweka core EuchdeanDistance -R first-lasuIgnoredins1505 petal length petal widthsepal lengtli sepal width class evaluate on train in Clustering model (full tiaininTest mode:Number of iterations: 6Within cluster sum of squared eiTors:

13、 1 7050986081225123Initial starting points (random)Clusier 0 4 7,1 4Clus:er 1 4.3,1.3Clus:er 2: 5.1,23petal lengthpetal widtli3.7587 +/J.76441.1987+/-0.76324.2962+/-O.5O531 35 +/-0.18561 464+/0 17350 n44 +/-0 10725.5667+/-0 5492.0562 +/-0.2422Time taken to build model (full training data) 0 02 seconds Model and evaluation on training set =Cluuered Instances012r】O85 5 45%in3 3 3Missing values globally replaced with mean/modeFinal clustei* centroidsCluster#AttributeFull Data01(1500)(520)(50.0

人人文庫> 全部分類> 行業(yè)資料 > 信息產(chǎn)業(yè)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

數(shù)據(jù)挖掘?qū)д揑risKDD分析教材

文檔簡介

溫馨提示

最新文檔

評論

數(shù)據(jù)挖掘?qū)д揑risKDD分析教材

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔