版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、復(fù)合進(jìn)化計(jì)算的文本聚類實(shí)現(xiàn) / - 1 - 中國科技論文在線 Implementing Text Clustering by Compound Evolutionary Algorithm# QIAO Yingying, SONG Wei* 5 (School of Internet of Things Engineering, Jiangnan University) Foundations: National Natural Science Foundation of China (61103129);Natural Science Foundation of Jiangsu Provin
2、ce (SBK201122266);the Specialized Research Fund for the Doctoral Program of Higher Education (20100093120004) Brief author introduction:QIAO Yingying(1987-),female,graduate,text clustering Correspondance author: SONG Wei(1981 ),male,associate professor,machine learning,information retrieval. Abstrac
3、t: In order to solve the problem of premature convergence and limitations in optimization of single clustering algorithm, this paper proposes a new clustering algorithm combining GA and QPSO for high-dimensional text clustering. Using robust global optimization performance of GA to initialize partic
4、les in QPSO, the searching efficiency and clustering performance of QPSO algorithm is improved 10 by giving full play to the advantages of the two algorithms. We conducted the experiments on the 20Newsgroup dataset and the experiments results showed that our method outperforms the sole GA or QPSO me
5、thod and achieved high text clustering accuracy and effectiveness measured by the precision, recall and F-measure with tolerable time consumption. 15 Key words: text clustering; GA clustering; QPSO clustering; PSO clustering 0 Introduction Clustering of text documents plays a vital role in efficient
6、 document organization, summarization, topic extraction and information retrieval1. It organizes the text sets according to 20 the nature of the text content and makes the entire text collection into several classes through the corresponding algorithms, as a result, the documents belonging to same c
7、lass are similar to each other as far as possible, and vice verse. Because of needing no training process and manual annotation to objects, text clustering has certain flexibility and high ability of automatic processing. 25 Clustering high-dimensional datasets is a contemporary challenge 2. With th
8、e development of science technology and broadening scope of engineering problems, text dataset are usually composed of tens of thousands of keywords. Thus the problem of high-dimensional text data makes the existing clustering algorithms fail to satisfy the request of information retrieval system. M
9、oreover, the improvement of single algorithm has its own limitations. Especially, the premature 30 convergence and poor local optimization ability are the common fault of almost all stochastic optimization algorithms. Thus, combining advantages of different swarm intelligence algorithms to improve t
10、he optimization performance is a new research hotspot in the current optimal field. Genetic algorithm (GA) as a typical random search and optimization algorithm is proposed first by professor J.H.Holland in the 1970s 3. This algorithm imitates the process of biologic 35 evolution and inheritance, fo
11、llowing Darwins competition principle of “survival of the fittest, superior bad discard”4. It can realize the optimization function through reproductive evolution of the advantage of individuals in a population. However, its local search ability is weak and tending to be premature. Quantum-behaved P
12、article swarm optimization (QPSO) is a swarm intelligence optimization algorithm, proposed by Sun J. et al5. The inspiration of QPSO algorithm came from 40 quantum mechanics and trajectory analysis of the individual particles behavior in particle swarm optimization (PSO) algorithm. Since PSO has bee
13、n proved to be not a global convergent algorithm, QPSO as a variant of PSO has been proposed to improve the global search ability of the classical PSO. / - 2 - 中國科技論文在線 In this paper, we propose a compound method which combines GA and QPSO (GA-QPSO) 45 for text clustering. In such a compound method
14、the robust global search ability of GA is utilized to initialize the particles of QPSO, which improves the robustness of QPSO algorithm. The remainder of this paper is organized as follows: Section 1 provides a general overview of the GA, and QPSO clustering algorithm. The GA-QPSO clustering algorit
15、hm is described in section 2. Section 3 provides the method to represent each document for clustering, and how to 50 compute the similarity between documents. Section 4 provides the detailed experimental setup and the comparing results between our approach and other clustering algorithms, e.g. GA, P
16、SO, and QPSO. The conclusion is in Section 5. 1 Clustering algorithm 1.1 GA clustering 55 Genetic algorithm (GA) is presented by professor J.H.Holland in the 1970s. It is a kind of effective optimization method based on the principle of natural selection and genetics. In GAs, the parameters of the s
17、earch space are encoded in the form of strings, called chromosomes. A collection of chromosomes is called a population and a random distributed population is created first, like different points in the search space 6. The process of solving the 60 problems is represented as chromosomes evolution pro
18、cess. At the initial population, a number of individuals are selected with the selection strategy, according to the fitness function associated with each chromosome. Then selection, crossover and mutation operator are applied to produce the next generation of the population. Generations evolution wi
19、ll go on, until the termination conditions is satisfied. In text clustering problem, a chromosome can be represented as: 65 ),.,.,( 1 ikijii MMMX ? (1) WhereijM refers to the jth cluster centroid vector of thethi chromosome. 1.2 QPSO clustering Quantum-behaved particle swarm optimization (QPSO), ins
20、pired by concepts from quantum mechanics and PSO, is a probabilistic global optimization algorithm. The particle in QPSO is 70 assumed to be in a bound state and attracted by a quantum potential well centered on its local attractor, thus having a new stochastic update equation for its position7. It
21、has been proved that the iterative equation leads QPSO to be global convergence8 and the particles move according to the following iterative equation: gdidid PPp ? )1( ? , ()rand? (2) 75 )/1ln( uXmbestpX iddidid ? ? , )1,0( Uu (3) Where mbest is the mean best position among personal best positions o
22、f all the particles in the whole population. That is: ? ? ?MiMiinMiiMiii PMPMPMPMmbest1 112111,.,1,11 (4) The idp , a stochastic point between idP and gdP , is the local attractor on thethd dimension of 80 the thi particle, ? is a random number distributed uniformly in 0,1, ? is another uniformly-di
23、stributed random number in 0,1 and ?, called Contraction-Expansion Coefficient, is the only parameter should be adjusted to control the search speed of QPSO. 2 Compound GA-QPSO clustering algorithm The development of GA has been very mature, and has the robust global search ability and 85 / - 3 - 中國
24、科技論文在線 better clustering performance compared to other clustering algorithms. In our method, GA as a global optimization algorithm start to the fast rough clustering and yield initial clustering partitions. The particles in QPSO will be initialized by the documents assigned to the partitions, and ea
25、ch corresponding clustering center of all particles is initialized from the same category respectively, which is even contributing to the work of calculating the mean best position of 90 QPSO. The method is effective and the global optimization performance of QPSO algorithm will be more stronger tha
26、n random initialization of the algorithm. Like chromosomes in GA, particles in PSO also refer to a number of potential solutions to the optimization problem. Then particles are initialized by the clustering results of GA. The representation of particles is in the form of equation (1), which is the s
27、ame to that of chromosomes 95 in GA. The fitness function for chromosomes or particles is measured as the quantization: ?KjCd ijiijiMdiFitness1),cos()( (5) Where id is one document vector belonging to cluster ijC . ijM refers to the jth cluster centroid vector of the thi chromosome in clusterijC . )
28、,cos( iji Md , the cosine measure between 100 id andijM . ),.,( 21 nij M ? (6) Where n is the number of terms in the document id . Some minor adjustments are also made to the parameters of the iterative formula according to the actual analysis of the text in our method. Different to update equation
29、(3), the adjusted iterative 105 equation is as follows: )( uXmbestpX iddidid ? ? (7) Where ? is a random number distributed uniformly in (0, 1 and u is another random constant between 00.5, which to limit the search space of particles so as to prevent overstepping the boundary. 110 The GA-QPSO algor
30、ithm can be summarized as: Step 1: GA clustering: the clustering process is the same to chapter 1.1. Step 2: Particles initialized: the corresponding clustering center of all particles is initialized by the same clustering partition yield by GA. Step 3: (a) Assign each document vector in the text co
31、llection to the closest centroid vector. 115 (b) Calculate the fitness value of particles based on equation (5). (c) Determine each particles previous best position and the current global best position among the particles best positions. (d) Calculate the mean best position among the particles by eq
32、uation (4) and get a stochastic point between idp and gdP by equation (2) for each dimension of the 120 particle. (e) Using equation (7) to generate the next solutions. Step 4: Repeat step 3 until one of the following termination conditions is satisfied. (a) The change in centroid vectors is more th
33、an a predefined value. (b) The pre-specified number of iterations is exceeded. 125 Step 5: Output the final solution obtained by the best particle. / - 4 - 中國科技論文在線 3 Text clustering 3.1 Representation of documents To apply any clustering algorithm on a dataset, documents must at first be represente
34、d as vectors in the term space. Documents are represented by the widely used vector-space model. 130 Most of the methods for text clustering and information retrieval are based on the VSM. The VSM is implemented by creating a term-document matrix and a vector of documents. In this model, each docume
35、nt is represented as a vector whose components are the occurrence frequencies of terms in one text. We represent each text as vector ,., 21 nd ?, where iw is the term weight of the thi indexing term in one text. Each dimension in the vector stands for a distinct 135 term in the term space of the tex
36、t collection. The term weight value denotes the significance of this term in a document. 3.2 Similarity metric To use a clustering algorithm we need to judge the similarity between two documents represented by term vectors. The traditional method of determining closeness of two vectors is to 140 use
37、 the size of the angle between them. This angle is computed by the use of the inner product. However, it is not necessary to use the actual angle. Often the expression “similarity coefficient” is used instead of an angle. We used the cosine distance, which is represented as ? jijiji dddddd ),cos( (8
38、) Where jninjijiji dd ?2211 145 4 Experiments In order to demonstrate the performance of PSO-GA on real text data, we used the 20Newsgroup dataset. There are 500 documents in dataset1 from five categories, comp.os.ms-windows.misc, soc.religion.christian, misc.forsale, rec.sport.hockey, sci.space. In
39、 dataset2, we used 800 documents, which were chosen from eight categories, sci.space, 150 misc.forsale, comp.os.ms-windows.misc, rec.sport.hockey, soc.religion.christian, talk.politics.guns, comp.sys.mac.hardware and comp.graphics. The summary of datasets used for experiments were shown in Tab. 1. T
40、ab. 1 The dataset used for experiments Dataset Number of documents Number of clusters Number of all terms tng500 500 5 9971 On the dataset, we run each algorithm twenty trials and record the average and maximum 155 F-measure values that each algorithm achieved, as shown in Tab.2 and 3. Tab. 2 The av
41、erage F-measure values produced by the various algorithms on dataset PSO QPSO GA GA-QPSO 0.775688 0.753269 0.806737 0.886567 Tab. 3 The maximum F-measure values produced by the various algorithms on dataset PSO QPSO GA GA-QPSO 0.850036 0.902578 0.880165 0.929634 From Tab.2 and Tab.3, GA-QPSO generat
42、es the best clustering results for both average and maximum F-measure value. From Tab.3, the maximum F-measure of QPSO is higher than PSO, 160 which indicates the ability of searching global optimal solution of QPSO exceeds PSO, though the / - 5 - 中國科技論文在線 average F-measure of QPSO is a little lower
43、 than PSO. From Tab.2 and 3, the average and maximum F-measure value of GA is always higher than those of PSO, which demonstrates that the clustering effect of GA is better than PSO. Moreover, we use fitness value to reflect the true variation on each generation. Fig.1 showed 165 the best, the worst
44、 and the common situations of 20 runs of the GA-QPSO algorithm, in the meantime, the best one of 20 runs of other algorithms were shown for comparison. Fig. 1 Fitness values of the different algorithms on each generation for dataset From Fig.1, we can see that GA-QPSO(Best) achieves the best perform
45、ance with the fitness 170 value of 126.876. The results of GA-QPSO(Common) are inferior to GA-QPSO(Best) and superior to other algorithms. That is to say, the searching performance of GA-QPSO is stable, comparing to sole GA or QPSO. Moreover, although its worst results are not as good as the best re
46、sults of PSO and QPSO, it still performs better than GA which converged quickly. 5 Conclusion 175 In this paper, a compound QPSO and GA algorithm for text clustering is proposed. The method enhances the global search ability of QPSO algorithm and has very high searching efficiency. In order to verif
47、y the validity of the proposed clustering algorithm, the experiments were conducted on 20Newsgroup dataset. The experimental results showed that the GA-QPSO algorithm surpasses QPSO and GA alone, and has better clustering effect in terms of F-measure. 180 That is to say, the new compound clustering algorithm combining QPSO and GA is effective and feasible. References 1 Kowalski G. Information retrieval systems: theory and implementation J.Journal of the American Society for Informatio
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025至2030年中國蛭石板數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國直式高壓注油器數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國激光防偽標(biāo)簽數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國手提式氣動(dòng)打標(biāo)機(jī)數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國全自動(dòng)液壓緊固機(jī)數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025至2030年中國LPG中壓減壓閥數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 2025年中國耐熱硅橡膠橡套軟電纜市場(chǎng)調(diào)查研究報(bào)告
- 2025年中國燒烤用竹簽市場(chǎng)調(diào)查研究報(bào)告
- 2025年中國多級(jí)深井泵市場(chǎng)調(diào)查研究報(bào)告
- 2025年度個(gè)人智能家居控制系統(tǒng)定制合同2篇
- 電纜擠塑操作手冊(cè)
- 浙江寧波鄞州區(qū)市級(jí)名校2025屆中考生物全真模擬試卷含解析
- 2024-2025學(xué)年廣東省深圳市南山區(qū)監(jiān)測(cè)數(shù)學(xué)三年級(jí)第一學(xué)期期末學(xué)業(yè)水平測(cè)試試題含解析
- IATF16949基礎(chǔ)知識(shí)培訓(xùn)教材
- 【MOOC】大學(xué)生創(chuàng)新創(chuàng)業(yè)知能訓(xùn)練與指導(dǎo)-西北農(nóng)林科技大學(xué) 中國大學(xué)慕課MOOC答案
- 勞務(wù)派遣公司員工考核方案
- 基礎(chǔ)生態(tài)學(xué)-7種內(nèi)種間關(guān)系
- 2024年光伏農(nóng)田出租合同范本
- 《阻燃材料與技術(shù)》課件 第3講 阻燃基本理論
- 2024-2030年中國黃鱔市市場(chǎng)供需現(xiàn)狀與營銷渠道分析報(bào)告
- 新人教版九年級(jí)化學(xué)第三單元復(fù)習(xí)課件
評(píng)論
0/150
提交評(píng)論