計算機等級考試二級VB模擬試題60_第1頁
計算機等級考試二級VB模擬試題60_第2頁
計算機等級考試二級VB模擬試題60_第3頁
計算機等級考試二級VB模擬試題60_第4頁
計算機等級考試二級VB模擬試題60_第5頁
已閱讀5頁,還剩35頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領

文檔簡介

1、Clustering Part2BIRCHDensity-based Clustering - DBSCAN and DENCLUE GRID-based Approaches - STING and ClIQUESOMOutlier DetectionSummaryRemark: Only DENCLUE and briefly grid-based clusterin will be covered in 2007.1BIRCH (1996)Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zha

2、ng, Ramakrishnan, Livny (SIGMOD96)Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clusteringPhase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of t

3、he data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scansWeakness: handles only numeric data, and sensitive to the order of the data record.2Cluster

4、ing Feature VectorClustering Feature: CF = (N, LS, SS)N: Number of data pointsLS: Ni=1=XiSS: Ni=1=Xi2CF = (5, (16,30),(54,190)(3,4)(2,6)(4,5)(4,7)(3,8)3CF TreeCF1child1CF3child3CF2child2CF6child6CF1child1CF3child3CF2child2CF5child5CF1CF2CF6prevnextCF1CF2CF4prevnextB = 7L = 6RootNon-leaf nodeLeaf nod

5、eLeaf node4Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary 5Density-Based Clustering M

6、ethodsClustering based on density (local cluster criterion), such as density-connected points or based on an explicitly constructed density functionMajor features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parametersSeveral interesting studies:DBSCAN: Ester, et al. (KDD96)O

7、PTICS: Ankerst, et al (SIGMOD99).DENCLUE: Hinneburg & D. Keim (KDD98)CLIQUE: Agrawal, et al. (SIGMOD98)6Density-Based Clustering: BackgroundTwo parameters:Eps: Maximum radius of the neighbourhoodMinPts: Minimum number of points in an Eps-neighbourhood of that pointNEps(p):q belongs to D | dist(p,q)

8、= MinPts pqMinPts = 5Eps = 1 cm7Density-Based Clustering: Background (II)Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, , pn, p1 = q, pn = p such that pi+1 is directly density-reachable from piDensity-connectedA point p is density-

9、connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.pqp1pqo8DBSCAN: Density Based Spatial Clustering of Applications with NoiseRelies on a density-based notion of cluster: A cluster is defined as a maximal set of dens

10、ity-connected pointsDiscovers clusters of arbitrary shape in spatial databases with noiseCoreBorderOutlierEps = 1cmMinPts = 5Density reachablefrom core pointNot density reachablefrom core point9DBSCAN: The AlgorithmArbitrary select a point pRetrieve all points density-reachable from p wrt Eps and Mi

11、nPts.If p is a core point, a cluster is formed.If p ia not a core point, no points are density-reachable from p and DBSCAN visits the next point of the database.Continue the process until all of the points have been processed.10DENCLUE: using density functionsDENsity-based CLUstEring by Hinneburg &

12、Keim (KDD98)Major featuresSolid mathematical foundationGood for data sets with large amounts of noiseAllows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data setsSignificant faster than existing algorithm (faster than DBSCAN by a factor of up to 45)But needs

13、a large number of parameters11Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure.Influence function: describes the impact of a data point within its neighborhood.Overall density of the data space c

14、an be calculated as the sum of the influence function of all data points.Clusters can be determined mathematically by identifying density attractors.Density attractors are local maximal of the overall density function.Denclue: Technical Essence12Gradient: The steepness of a slopeExample13Example: De

15、nsity ComputationD=x1,x2,x3,x4fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) + influence(x4)=0.04+0.06+0.08+0.6=0.78x1x2x3x4x0.60.080.060.04yRemark: the density value of y would be larger than the one for x14Density Attractor15Examples of DENCLUE Clusters16Basic Steps DENCLUE Algorithm

16、sDetermine density attractorsAssociate data objects with density attractors ( initial clustering)Merge the initial clusters further relying on a hierarchical clustering approach (optional) 17Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Majo

17、r Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary 18Steps of Grid-based Clustering AlgorithmsBasic Grid-based AlgorithmDefine a set of grid-cellsAssign objects to the appropriate grid cell and comp

18、ute the density of each cell.Eliminate cells, whose density is below a certain threshold t.Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function)19Advantages of Grid-based Clustering Algorithmsfast:No distance computationsClustering is performe

19、d on summaries and not individual objects; complexity is usually O(#-populated-grid-cells) and not O(#objects)Easy to determine which clusters are neighboringShapes are limited to union of grid-cells20Grid-Based Clustering Methods Using multi-resolution grid data structureClustering complexity depen

20、ds on the number of populated grid cells and not on the number of objects in the datasetSeveral interesting methods (in addition to the basic grid-based algorithm)STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)CLIQUE: Agrawal, et al. (SIGMOD98)21STING: A Statistical In

21、formation Grid ApproachWang, Yang and Muntz (VLDB97)The spatial area area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution22STING: A Statistical Information Grid Approach (2)Each cell at a high level is partitioned into a number of sm

22、aller cells in the next lower levelStatistical info of each cell is calculated and stored beforehand and is used to answer queriesParameters of higher level cells can be easily calculated from parameters of lower level cellcount, mean, s, min, max type of distributionnormal, uniform, etc.Use a top-d

23、own approach to answer spatial data queries23STING: Query Processing(3)Used a top-down approach to answer spatial data queriesStart from a pre-selected layertypically with a small number of cellsFrom the pre-selected layer until you reach the bottom layer do the following:For each cell in the curren

24、t level compute the confidence interval indicating a cells relevance to a given query;If it is relevant, include the cell in a clusterIf it irrelevant, remove cell from further considerationotherwise, look for relevant cells at the next lower layerCombine relevant cells into relevant regions (based

25、on grid-neighborhood) and return the so obtained clusters as your answers. 24STING: A Statistical Information Grid Approach (3)Advantages:Query-independent, easy to parallelize, incremental updateO(K), where K is the number of grid cells at the lowest level Disadvantages:All the cluster boundaries a

26、re either horizontal or vertical, and no diagonal boundary is detected25CLIQUE (Clustering In QUEst) Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98). Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both

27、 density-based and grid-basedIt partitions each dimension into the same number of equal length intervalIt partitions an m-dimensional data space into non-overlapping rectangular unitsA unit is dense if the fraction of total data points contained in the unit exceeds the input model parameterA cluster

28、 is a maximal set of connected dense units within a subspace26CLIQUE: The Major StepsPartition the data space and find the number of points that lie inside each cell of the partition.Identify the subspaces that contain clusters using the Apriori principleIdentify clusters:Determine dense units in al

29、l subspaces of interestsDetermine connected dense units in all subspaces of interests.Generate minimal description for the clustersDetermine maximal regions that cover a cluster of connected dense units for each clusterDetermination of minimal cover for each cluster27Salary (10,000)2030405060age5431

30、26702030405060age54312670Vacation(week)ageVacationSalary3050 = 328Strength and Weakness of CLIQUEStrength It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspacesIt is insensitive to the order of records in input and does not presume som

31、e canonical data distributionIt scales linearly with the size of input and has good scalability as the number of dimensions in the data increasesWeaknessThe accuracy of the clustering result may be degraded at the expense of simplicity of the method29Chapter 8. Cluster AnalysisWhat is Cluster Analys

32、is?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary 30Self-organizing feature maps (SOMs)Clustering is also performed by having several uni

33、ts competing for the current objectThe unit whose weight vector is closest to the current object winsThe winner and its neighbors learn by having their weights adjustedSOMs are believed to resemble processing that can occur in the brainUseful for visualizing high-dimensional data in 2- or 3-D space3

34、1Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary 32What Is Outlier Discovery?What are

35、outliers?The set of objects are considerably dissimilar from the remainder of the dataExample: Sports: Michael Jordon, Wayne Gretzky, .ProblemFind top n outlier points Applications:Credit card fraud detectionTelecom fraud detectionCustomer segmentationMedical analysis33Outlier Discovery: Statistical

36、 ApproachesAssume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on data distributiondistribution parameter (e.g., mean, variance)number of expected outliersDrawbacksmost tests are for single attributeIn many cases, data distributio

37、n may not be known34Outlier Discovery: Distance-Based ApproachIntroduced to counter the main limitations imposed by statistical methodsWe need multi-dimensional analysis without knowing data distribution.Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fr

38、action p of the objects in T lies at a distance greater than D from OAlgorithms for mining distance-based outliers (see textbook) Index-based algorithmNested-loop algorithm Cell-based algorithm35Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of

39、Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary 36Problems and ChallengesConsiderable progress has been made in scalable clustering methodsPartitioning/Representative-based: k-means, k-medoid

40、s, CLARANS, EMHierarchical: BIRCH, CUREDensity-based: DBSCAN, DENCLUE, CLIQUE, OPTICSGrid-based: STING, CLIQUEModel-based: Autoclass, Cobweb, SOMCurrent clustering techniques do not address all the requirements adequately37References (1)R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic

41、 subspace clustering of high dimensional data for data mining applications. SIGMOD98M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD99.P. Arabie, L. J. H

42、ubert, and G. De Soete. Clustering and Classification. World Scietific, 1996M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD96.M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing

43、 techniques for efficient class identification. SSD95.D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB98.S. Guha, R. Ras

44、togi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD98.A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.38References (2)L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 199

45、0.E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.R. Ng a

46、nd J. Han. Efficient and effective clustering method for spatial data mining. VLDB94.E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster:

47、A multi-resolution clustering approach for very large spatial databases. VLDB98.W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB97.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD

48、96.399dngqOuSyWCZF%J)N36:akeoisPvTzXD!H*K5.8cmgqOtRxVBZF%J(M=27blfpisQwUAYE$H*L+15;9cmgqOuSxVBZF%J)N=26:akenhrPvTzXD#G&K-04.7blfpjtQwUAYE$I(L+15;9cmgqOuSyWBZF%J)N36:akeoisQvTzXD!H*L-04.8clfpjtRxVBYE$I(M=25;9dnhrOuSyWC#G&J)N37akeoisQwUzXD!H*L+04.8PvTzXC#G&K-047blfpjsQwUAYE$I*L+15;9nhrO!H*L+15;8cmgqOu

49、RxVBZF%J)M=26:akeisQwTzXD!H*L-04.8cmfpjtRxVBYE$I(M=27bkeoisQwUzXD!H*L+14.8cmgqjtRxVBZF$I(M=27blfoisQwUAYD!H*L+15.8cmgqOtRxVBZF%J(M=27blfpisQwUAYE$H*L+15;9cmgqOuSxVBZF%J)N=26:akenhrPvTzXD#G&K-04.7blfpjtQwUAYE$I(L+15;9cmgqOuSyWBZF%J)N36:akeoirPvTzXD!H&K5;9dmgqOuSyWBZF%J)N37:akeoisQvTzXD!H*L-04.8clfpjt

50、RxVBYE$I(M=27akeoisQwUzXD!H*L+14.8cmgpjtRxVBZF$I(M=26:akeoisQvTzXD!H*K-04.8clfpjtRxVAYE$I(M=25;9dnhrOuSyWC#G%J)N37akeoisQwTzXD!H*L+04.8cmgpjtRxVBZE$I(M=27bleoisQwUAXD!H*L+14.8cmgqjH*L+04.8cmgpjtRxVBZF$I(M=27bleoisQwUAYD!H*L+14.8cmgqOtRxVBZF%I(M=27blfpisQwUAYE!H*L+15;8cmgqOuSxVBZF%J)M=27bkeoisQwUAXD!

51、H*L+14.8cmgqjtRxVBZF$I(M=27blfoisQwUAYD!H*L+15.8cmgqOuRxVBZF%J(M=27blfoisQwUAYE!H*L+15.8cmgqOuRxVBZF%J(M=27blfpjsQwUAYE$H*L+15;9cmgqOuSyVBZF%J)N=26:akeohrPvTzXD#G&K-04.7blfpjtRwUAYE$I(L+15;9dmgqOuSyW15;9cmgqOuSyVBZF%J)N26:akeohrPvTzXD!G&K-04.7blfpjtRwUAYE$I(M+15;9dmgqOuSyWCZF%J)N37akeoisQwTzXD!H*L+0

52、4.8cmfpjtRxVBZE$I(M=27bkeoisQwUAXD!H*L+14.8cTzXD#G&K-04.7blfpjtQwUAYE$I(L+15;9dmgqOuSyWBZF%J)N36:akeoirPvTzXD!H&K-04.8blfpjtRxUAYE$I(M=15;9UAYE$I(L+15;9dmgqOuSyWBZF%J)N37:akeoZF%J)N37:akeoisQwTzXD!H*L-04.8cmfpjtRx-04.8blfpjtRxVAYE$I(M=25;9dnhqOuSyWC#G%J)N37:akeoisQwTzXD!H*L+04.8cmfpjtRxVBZE$I(M=27ak

53、eoisQwTzXD!H*L+04.8cmfpjtRxVBZE$I(M=27bkeo#G%J)N37akeoisQwUzXD!H*L+04.8cmgpjtRxVBZE$I(M=27bleoisQwUAXD!H*L+1dnhrPUAYE$I*L+15;9cmgqOuSyVBZF%J)N26:akeoisPvTzXD!H*K-04.8clfpjtRxVAYE$I(M=25;9dnhqOuSyWC#G%J)N37akeoisQwTzXD!H*L+09dnhrPvSyWC#G&K-N37bleoisQwUAYD!H*L+19dmgqOuSyWCZF%J)N36:akxVBZF$I(M=27bleois

54、QwUAYD!H*L+15.8cmgqOtRxVBZF%J(M=27blfoisQwUAYD!H*L+15.8cmgqOtRxVBZF%J(M=27blfpisQwUA37blfoisQwUAYE!H*L+15.8cmgqOuRxVBZF%J(M=27blfpjsQwUAYE$H*L+15;9cmgqOuSyVBZF%J)N=27akeoisQwUzXD!H*L+04.8cmgpjtRxVBZF$I(M=27bleo!H*L+14.8cmgqjtRxVBZF%I(M=27blfoisQwUAYE!H*L+15;8cmgqOuRxVBZF%J)MblfpjtRxUAYE$I(M=15;9dnhq

55、OuSyWC#F%J)N37:akeoisQvTzXD!H*L-04.8cmfpjtRxVBYE$I(M=25;9dnhrOuSyW+15;8cmgqOuSxVBZF%J)M=27blfpjtQwUAYE$I*L+15;9cmgqOuSyVBZF%J)N26:akeoiF%J)M=27blfpjtQwUAYE$I*L+15;9cmgqOuSyWBZF%J)N26:akeoirPvTzXD!G&K-0hrPvTzXD#G&K-04.7blfpjtQwUAYE$I(L+15;9cmgqOuSyWBZF%J)N36:akeoirPvTzXD!H&K-04.8blfpjtRxUAYE7blfpjtQw

56、UAYE$I(L+15;9dmgqOuSyWBZF%J)N36:akeoirPvTzXD!H&K-04.8blfpjtRxUAYE$I(M=15;9dnYE$I(L+15;9dmgqOuSyWBZF%J)N36:akeoisPvTzXD!H&K-04.8blfpjtRxUAYE$I(M=15;9dnhqOuSyWC#F%JdmgqOuSyWCZF%J)N36:akeoisPvTzXD!H*K-04.8blfpjtRxVAYE$I(M=15;9dnhqOuSyWC#G%.8cmgqOtRxVBZF%J(M=27blfpisQwUAYE$HUAYE$H*L+15;9cmgqOuSyVBZF%J)N

57、=26:akeohrPvTzXD#G&K-04.7blfpjtRwUAYE$I(L+15;9dmgqOuSyWBZF%J)N37:akeoisQUAYE!H*L+15;8cmgqOuRxVBZF%J)M=27blfpjsQwUAYE$I*L+15;9cmgqOuSyVBZF%J)N26:akeoisPvTzXD!H*K-04.8blfpjtRxVAYE$I=27blfpisQwUAYE$H*L+15;8cmgqOuSxVBZF%J)N=27blfpjtQwUAYE$I(L+15;9cmgqOuSyWBZF%J)N26:akeoirPvTzXD!H&K-04.8blfpjtRxUAYE$I(M+

58、7blfoisQwUAYE!H*L+15.8cmgqOuRxVBZF%J)M=27blfpjsQwUAYE$I*L+15;9cmgqOuSyVBZF%J)N=27bleoisQwUAYD!H*L+14.8cmgqOtRxVBZF%I(M=27blfpisQwUAYE!H*L+15;8cmgqOuSxVBZF%J)M=27blfpjtQwUAYE$I*L+15;9cmgqOuSyVBZF%J)N26:akeoisPvTzXD!H*K-04.8clfpjtRxVAYE$I(M=25;9dnhqOuSyWC#G%J)N37akeoisQwTzXD!H*L+04.8cmfpjtRxVBZE$I(M=2

59、7bkeoisQwUAXD!H*L+14.8cmgqjtRxVBZF%I(M=27:akeoisQvTzXD!H*L-04.8clfpjtRxVBYE$I(M=27lfpjsQwUAYE$I*L+15;9cmgqOuSyVBZ7blfpisQwUAYE$H*L+15;8cmgqOuSxVBZF%J)M=27blfpjtQwUAYE$I*L+15;9cmgqOuSyWBZF%J)N26:akeoirPvTD!H*L+14.8cmgqjtRxVBZF%I(M=27blfoisQwUAYE!H*L+15.8cmgqOuRxVBZF%J(M=26:akdnhrPvTzXC#G&RxVBZE$I(M=2

60、7akesQwUAYE$I*L+15;9cmgqOuSyVBZF%J)N26:akeooisQwUzXD!H*L+14.8cmgpjtRxVBZF$I(M=27bleoisQwUAYD!H*L+15.8cmgqOtRxVBZF%J(M=27akeoisQwTzXD!H*L+04.8cmfpjtzXC#G&K-037blfpjsQwUAYE$H*L+15;9cmgqOuSyVBZF%JN37akeoisQwUzXD!H*L+04.F%J)N37blfpisQwYE$I(M=15;9dnhqOuS7bkeoisQwUE$I*L+15;9cmgqOuSTzXD!H*L+04.8nhrOuSyWC#G

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論