新建文件夾 21235 隊 c 題_W_第1頁
新建文件夾 21235 隊 c 題_W_第2頁
新建文件夾 21235 隊 c 題_W_第3頁
新建文件夾 21235 隊 c 題_W_第4頁
新建文件夾 21235 隊 c 題_W_第5頁
已閱讀5頁,還剩16頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領

文檔簡介

1、第四屆“認證杯”數(shù)學中國 數(shù)學建模國際賽承諾書我們仔細閱讀了第四屆“認證杯”數(shù)學中國數(shù)學建模國際賽的競賽規(guī)則。我們完全明白,在競賽開始后參賽隊員不能以任何方式(包括電話、電子郵 件、網(wǎng)上咨詢等)與隊外的任何人(包括指導教師)研究、討論與賽題有關的問 題。我們知道,別人的成果是違反競賽規(guī)則的, 如果引用別人的成果或其他公開的資料(包括網(wǎng)上查到的資料),必須按照規(guī)定的參考文獻的表述方式在正文引用處和參考文獻中明確列出。我們鄭重承諾,嚴格遵守競賽規(guī)則,以保證競賽的公正、公平性。如有違反 競賽規(guī)則的行為,我們將受到嚴肅處理。我們允許數(shù)學中國網(wǎng)站()公布論文,以供網(wǎng)友之間學習交

2、流,數(shù)學中國網(wǎng)站以非商業(yè)目的的論文交流不需要提前取得我們的同意。 我們的參賽隊號為:1235 我們選擇的題目是: C 題參賽隊員 (簽名) : 隊員 1:王東全隊員 2:吳卓其隊員 3:周洋參賽隊教練員 (簽名):楊劍波 更多數(shù)學建模資料請關注微店店鋪“數(shù)學建模學習交流”/RHO6PSpATeam# 1235Page 2 of 21第四屆“認證杯”數(shù)學中國 數(shù)學建模國際賽編 號 專 用 頁參賽隊伍的參賽隊號:(請各個參賽隊提前填寫好):1235 競賽統(tǒng)一編號(由競賽組委會送至評委團前編號): 競賽評閱編號(由競賽評委團評閱前進行編號): Team# 12

3、35Page 3 of 21UsingDataMiningTechniquesforDetectingTerror-RelatedActivities on the WebAbstract:The number of terror attacks is increasing year by year. On November 13, 2015, theterrorist attack that took place in Paris caused hundreds of deaths. The hazards of cyber terrorism have already become mor

4、e and more serious. The USA has enacted a number of laws aimed at the prevention of cyber terrorism, such as “USA PATRIOT Act”. It is necessary to establish a model for the prevention of terrorist network spread and to monitor and find the people with a tendency to terrorism. The Internet behavior a

5、nalysis and risk assessment model (IBARA) was established for the Internet to assess the internet behaviors of those people who are monitored. In this paper, based on IBARA, we not only research the relationship between peoples Internet behavior and their possible terrorist tendency, but also analyz

6、e and discuss the relative quantitative risk index of individual terrorism tendency and the relevant strategies to prevent terrorist attacks.Firstly, the Internet behavior was divided into two parts: Web text and image. The complex vector space of word frequency analysis algorithm was adopted to est

7、ablish the personal tendency of terrorism risk index sub module (PTTRISM) which can predict peoples tendency to terrorism. In PTTRISM, this paper analyzes the behavior of individual Web text using the keyword extraction technique and frequency analysis technique. According to the analysis results, i

8、ts given the value of the risk index of individual terrorism in this paper. Using the PTTRISM to analyze the data sample, we had drawn a conclusion that most people who have been access to the terrorism-related information are not likely to become potential terrorists.The PTTRISM could calculate peo

9、ples risk index about the tendency to terrorism through analyzing Internet behavior.Secondly, in fact, the object of network monitoring is not a person but a large number of people, which makes to monitoring data too large and complex. In order to facilitate the rapid and efficient classification an

10、d analysis of big data, a big data clustering statistics sub module (MDCSSM) is established based on the technique of density-based clustering. At the same time, in order to shorten the computing time of the MDCSSM, in this paper is adopted the standard particle swarm optimization (PSO) with the wei

11、ght-shrink factor. It realized the effective, fast and automatic clustering analysis of datasets. Validation of the sub model using the data,The model can be used to analyze a large amount of data. Due to sacristy of the monitoring data, we utilize some frequently-tested public datasets, “Iris”, “Gl

12、ass”, “Wine” and “Aggregation” to replace the monitoring data and verify the clustering algorithm. The clustering results demonstrate that the clustering algorithm can categorize the monitoring datasets in an effective, fast and automatic manner.Finally, We propose some suggestions to President Obam

13、a about fighting against terrorism as follows based on IBARA :1. Put into more resources in terms of network against terrorism. You could build User Online Monitoring System of Behavior and Psychological to monitor and assess the behavior of the public.2. Establish Information security evaluation sy

14、stem to weaken and even prevent the terrorist propaganda through the network.3. Strengthen public anti-terrorism education, raise public awareness of anti-terrorism.Due to the time constraints, the model still has some defects which need to be improved. In the PTTRI sub module, factors of voice and

15、image files are not considered. In the MDCS sub module, the selection of adaptive function in Clustering analysis could be further improved. With the further improvement of the model, we will get more accurate results.Key words: PSO, word frequency analysis algorithm , density-based clustering, terr

16、orism, Internet behaviorTeam# 1235Page 4 of 21ContentsI. Introduction5II. The Description of the Problem52.1 Our Approximation the Whole Course of Data Mining To terrorists onwebsite52.2 The Differences in Weights and Sizes of Available Data6III. IBARA63.1 PTTRISM63.1.1 Terms, Definitions and Symbol

17、s in PTTRISM63.1.2 Assumptions in PTTRISM63.1.3 The Model of Terrorism-Related Website Browsing and Vector Space Models of Lexical Meaning73.1.4 The Model of Risk Index83.1.5 Solutions and Results for PTTRISM93.1.6 Strength and Weakness in PTTRISM113.2 MDCSSM123.2.1 Extra Symbols123.2.2 Additional A

18、ssumptions123.2.3 The Foundation of MDCSSMto Categorize Big Data123.2.4 The Results of MDCSSM153.2.5 Strength and Weakness18IV. Conclusions194.1 Conclusions of the Problems194.2 Methods Used in our Models194.3 Applications of our Models19V. Proposal to Fighting Terrorism20VI. References20Team# 1235P

19、age 5 of 21I. IntroductionIn order to indicate the origin of web-related terrorism problems, the following background is worth mentioning.Terrorist cells are using the Internet infrastructure to exchange information and recruit new members and supporters12 (Lemos 2002; Kelley 2002). For example, hig

20、h-speed Internet connections were used intensively by members of the infamous Hamburg Cell that was largely responsible for the preparation of the September 11 attacks against the United States3 (Corbin 2002). This is one reason for the major effort made by law enforcement agencies around the world

21、in gathering information from the Web about terror-related activities. It is believed that the detection of terrorists on the Web might prevent further terrorist attacks2 (Kelley 2002). One way to detect terrorist activity on the Web is to eavesdrop on all traffic of Web sites associated with terror

22、ist organizations in order to detect the accessing users based on their IP address. Unfortunately it is difficult to monitor terrorist sites3 (such as Azzam Publications (Corbin 2002) since they do not use fixed IP addresses and URLs. The geographical locations of Web servers hosting those sites als

23、o change frequently in order to prevent successful eavesdropping. To overcome this problem, law enforcement agencies are trying to detect terrorists by monitoring all ISPs traffic4(Ingram 2001), though privacy issues raised still prevent relevant laws enforced.frombeingFigure 1: the annual number of

24、 terrorists attack from 1968 to 2009II. The Description of the Problem2.1 Our Approximation the Whole Course of Data Mining Toterrorists on websitesHow often does the internet user who is monitored visit the website that contains terrorized information and propaganda of terrorism.The lexical meaning

25、 of contents of their emails, chats, post views and text files being downloaded.Team# 1235Page 6 of 21As for other formats of files, such as videos, images and audios, the techniques of the image description and voice recognition are used as a tool to detect the terrorists.For categorizing the monit

26、oring data, the cluster techniques are adopted to sect data in an effective, fast and automatic manner.Present some useful suggestions to President Obama for fighting terrorism2.2 The Differences in Weights and Sizes of Available DataDue to differences between the collected datasets, its quite neces

27、sary to preprocess the available data, Such as text datasets, numerical datasets, image datasets and even voice datasets.1)The Preprocess of Text Data: remove non-alphabetical characters from the text dataset and put them into MATLAB cell structures.The Preprocess of Image Data: remove non-imagery i

28、nformation from the image datasets and convert the RGB images into the gray-value images. If the image datasets are polluted by noises, its quite necessary to denoise image before analyzing the relevant information.The Preprocess of Voice: if the audio datasets are polluted by noises, its a need to

29、implement audio-denoising steps before digging out the auditory information.The Preprocess of Numerical Dataset: Due to existence of differences between data samples in units and magnitudes, the numerical dataset needs to be normalized and standardized.2)3)4)III. IBARA3.1 PTTRISMIn this paper a new

30、methodology to detect users accessing terrorist related information by Frequency-Analysis Techniques, Vector Space Models of Lexical Meaning5, Image Description6 and Voice Recognition7, Data Cluster Terms, Definitions and Symbols in PTTRISMThe signs and definitions are mostly generated from o

31、ur models in this paper.R is the risk index, which denotes the risk degree that the Internet user canbe.Ptextis the degree that the text contents that the Internet user involves aretrelated to terrorism during the time interval t.Pimage isthe degree that images that the Internet user browses andtdow

32、nload are related to terrorism during the time interval t.Paudio isthe degree that audios that the Internet user listens to andtdownload are related to terrorism during the time interval t.wi, jis the weight factor of vector space q .3.1.2 Assumptions in PTTRISMThe main design criteria for the propo

33、sed methodology are:Team# 1235Page 7 of 21Training the detection algorithm should be based on the content of existing terrorist sites and known terrorist traffic on the Web.Detection should be carried out in real-time. This goal can be achieved only if terrorist information interests are presented i

34、n a compact manner for efficient processing.The detection sensitivity should be controlled by user-defined parameters to enable calibration of the desired detection performance.All information related to terrorism is not encrypted by enciphered algorithms, such as RSAAll information that can be moni

35、tored is presented by images, audios and texts.Neglect the social attributes of the monitored person and only consider the network properties3.1.3 The Model of Terrorism-Related Website Browsing and VectorSpace Models of Lexical MeaningOne major issue in this model is the representation of textual c

36、ontent of Web pages. More specifically, there is a need to represent the content of terror-related pages as against the content of a currently accessed page in order to efficiently compute thesimilarity between them9. This study will use the vector-space model commonlyused in Information Retrieval a

37、pplications for representing terrorists interests and eachaccessed Web page. In the vector-space model, the weightwi, jassociated with apair (ki , d j ) is positive and non-binary. Further, the index terms in the query q arebe the weight associated with the pair (ki , q) where wi,q ? 0 .also weighte

38、d. Letwi,qrThen, the query vector q = (w1,q , w2,q ,K, wt ,q ) is defined as where t is the total number of index. The vector for a document d j is represented by d j = (w1, j , w2, j ,K, wt , j ) . The vector model proposes to evaluate the degree of similarity of the document d j with regard to the

39、 query q as the correlation between the vectors d j and q This correlationcan be measured by the cosine of the angle between these two vectors as,rd jqsim =(3-1-1)rrd jqrd jrWhereandqare the norms of are the norms of the document and queryvectors. In the vector space model, the frequency of a term k

40、i inside a document d jnifreq=(3-1-2)i, jNjThe normalized frequency of term ki inside a document d j is given byfreqi, jf=(3-1-3)i , jmax( freq )i, jThe best known term-weighting schemes use weights which are given byTeam# 1235Page 8 of 21wi, j =- fi, j log( freqi, j )(3-1-4)In this paper each Web p

41、age in considered as a document and is represented as a vector. The terrorists interests are represented by several vectors where each vector relates to a different topic of interest. The query of the methodology defines and represents the typical behavior of terrorist users based on the content of

42、their Web activities. The query is based on a set of Web pages that were downloaded from terrorist related sites and is the main input of the detection algorithm. It is assumed that it is possible to collect Web pages from terror-related sites. The content of the collected pages is the input to the

43、Vector Generator module that converts the pages into vectors of weighted terms10 (each page is converted to one vector).In order to define the degree that the internet user browses the terrorism-related websites during the time interval t, the formula b(m) is defined by the function that the interne

44、t user behaves like a potential terrorist when browsing the website m as follows:b(m) = simc (sim)x ? threshold x threshold(3-1-5)where c (x) = ?1?0?= ?b(m)Ptext(3-1-6)tMIn this paper, we adopt 0.5 as the value of threshold. The query in this paper is listed in the table below.3.1.4 The Model of Ris

45、k IndexHere we report the remarkable finding that identical patterns of violence are currently emerging within these different international arenas. Not only have the wars in Iraq and Colombia evolved to yield a same power-law behavior, but this behavior isIDDetails of Queries1Bomb Suicide2Gunfire3K

46、idnap4Massacre5Attack to Civilians6Islamic State of Iraq and al Shams7Qaeda8al-Shabaab9Islamic State10hijack11AssassinationTeam# 1235Page 9 of 21currently of the same quantitative form as the war in Afghanistan and global terrorism in non-G7 countries. Not only is the models power-law behavior in ex

47、cellent agreement with the data from Iraq, Colombia and non-G7 terrorism, it is also consistent with data obtained from the recent war in Afghanistan. Power-law distributions are known to arise in a large number of physical, biological, economic and social systems. In the present context, a power-la

48、w distribution means that the probability that an event will occur with behavior P is given by12R(P) = CP-a(3-1-7)where P ?(0,1 , P = Ptext and C and are positive coefficients1314, Previouststudies have shown that the distribution obtained from past terrorists attack exhibits a power-law with15 a =1

49、.809 .Since we cant get the coefficients C effectively, we define a relative risk index r among a group of people who are monitored during the specific time interval as followsRa (P)r =(3-1-8)? R (P)aa3.1.5 Solutions and Results for PTTRISM1) The Solution Steps to PTTRISM1. Generation of Term-Freque

50、ncy matrixIt is term-frequency matrix of all unique terms in document d j withj = 1,2,K, N .The term document matrix Freq is a M ? N matrix with ti unique terms in dictionary i = 1, 2,K, M and N documents the elements of Freq are represented as infreq which each element indicates the frequency of it

51、h term indocument.jthi, jThe Cranfield data collection is preprocessed to convert into individual 1398 text files. Also, non-embedding special characters and numerals have been removed from these files. 79,728 words have been collected which are then processed to find the frequency of unique words i

52、n each documents. The dictionary of unique words is of 7805 words. Thus the term frequency matrix is of size 7805? 1398.2. Generation of Query matrix and Term-weight calculations and resultAfter removing stop-list words and non-embedding special characters is used as query, which contributes to the

53、set of 1398 unique queries represented as q .Here, we have taken queries as titles of the document instead of the dataset queries so as to judge the relevancy more profoundly. The generated matrix for 1398 queries is Q1398?7805 . A term-frequency matrix is processed to get the term weights consideri

54、ng term-weighting schemes.2) The Results of PTTRISMTeam# 1235Page 10 of 21Figure 2: Index terms in a dictionaryFigure 2 shows the distribution of index terms in dictionary for individual documents. The dictionary consists of 7805 unique terms.Figure 3: Frequency count of each unique term among data

55、collectionFigure 3 shows frequency count of each unique term in dictionary distributed in complete dataset. Some of the unique terms such as (ISIS, 2059), (Qaeda, 1245), (hijack, 1076), (Assassination, 897), with high frequency in entire documents is shown.Team# 1235Page 11 of 21Figure 4: the distri

56、bution of the P value among the monitored personsFigure 5: the distribution of the r value among the monitored personsIn the Figure 5, we define 0.1 as the threshold of the risk index. If a ones risk index is beyond 0.1, he or she can become a potential terrorist, and otherwise more likely to be an

57、ordinary personFrom the 1398 individual text files that are obtained from 1398 individuals, we can easily draw a conclusion that most people who have been access to the terrorism- related information are not likely to become potential terrorists. There are just 12 ones of all monitored persons who are likely to become potential terrorists, besides all their risk indexes are beyon

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論