版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
聚類(lèi)與樹(shù)Clustering2024/9/91主要內(nèi)容Microarrays(微陣列)HierarchicalClustering(層次聚類(lèi)或系統(tǒng)聚類(lèi))K-MeansClustering(K-均值聚類(lèi))2024/9/92ApplicationsofClusteringViewingandanalyzingvastamountsofbiologicaldataasawholesetcanbeperplexingItiseasiertointerpretthedataiftheyarepartitionedintoclusterscombiningsimilardatapoints.2024/9/93InferringGeneFunctionalityResearcherswanttoknowthefunctionsofnewlysequencedgenesSimplycomparingthenewgenesequencestoknownDNAsequencesoftendoesnotgiveawaythefunctionofgeneFor40%ofsequencedgenes,functionalitycannotbeascertainedbyonlycomparingtosequencesofotherknowngenesMicroarraysallowbiologiststoinfergenefunctionevenwhensequencesimilarityaloneisinsufficienttoinferfunction.2024/9/94MicroarraysandExpressionAnalysisMicroarraysmeasuretheactivity(expressionlevel)ofthegenesundervaryingconditions/timepointsExpressionlevelisestimatedbymeasuringtheamountofmRNAforthatparticulargeneAgeneisactiveifitisbeingtranscribedMoremRNAusuallyindicatesmoregeneactivity2024/9/95MicroarrayExperimentsProducecDNAfrommRNA(DNAismorestable)AttachphosphortocDNAtoseewhenaparticulargeneisexpressedDifferentcolorphosphorsareavailabletocomparemanysamplesatonceHybridizecDNAoverthemicroarrayScanthemicroarraywithaphosphor-illuminatinglaserIlluminationrevealstranscribedgenesScanmicroarraymultipletimesforthedifferentcolorphosphor’s2024/9/96MicroarrayExperiments(con’t)PhosphorscanbeaddedhereinsteadTheninsteadofstaining,laserilluminationcanbeused2024/9/97UsingMicroarraysEachboxrepresentsonegene’sexpressionovertime
TrackthesampleoveraperiodoftimetoseegeneexpressionovertimeTracktwodifferentsamplesunderthesameconditionstoseethedifferenceingeneexpressions2024/9/98UsingMicroarrays(cont’d)Green:expressedonlyfromcontrolRed:expressedonlyfromexperimentalcellYellow:equallyexpressedinbothsamplesBlack:NOTexpressedineithercontrolorexperimentalcells2024/9/99MicroarrayDataMicroarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/910MicroarrayData-REVISION-showinthematrixwhichgenesaresimilarandwhicharenot.Microarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedClusteringcomesintoplayTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/911ClusteringofMicroarrayDataPloteachdatumasapointinN-dimensionalspaceMakeadistancematrixforthedistancebetweeneverytwogenepointsintheN-dimensionalspaceGeneswithasmalldistancesharethesameexpressioncharacteristicsandmightbefunctionallyrelatedorsimilar.Clusteringrevealgroupsoffunctionallyrelatedgenes2024/9/912ClusteringofMicroarrayData(cont’d)Clusters2024/9/913HomogeneityandSeparationPrinciplesHomogeneity:ElementswithinaclusterareclosetoeachotherSeparation:Elementsindifferentclustersarefurtherapartfromeachother…clusteringisnotaneasytask!Giventhesepointsaclusteringalgorithmmightmaketwodistinctclustersasfollows2024/9/914BadClusteringThisclusteringviolatesbothHomogeneityandSeparationprinciplesClosedistancesfrompointsinseparateclustersFardistancesfrompointsinthesamecluster2024/9/915GoodClusteringThisclusteringsatisfiesboth
HomogeneityandSeparationprinciples2024/9/916ClusteringTechniquesAgglomerative:Startwitheveryelementinitsowncluster,anditerativelyjoinclusterstogetherDivisive:StartwithoneclusteranditerativelydivideitintosmallerclustersHierarchical:Organizeelementsintoatree,leavesrepresentgenesandthelengthofthepathesbetweenleavesrepresentsthedistancesbetweengenes.Similargenesliewithinthesamesubtrees2024/9/917HierarchicalClustering2024/9/918HierarchicalClustering:Example2024/9/919HierarchicalClustering:Example2024/9/920HierarchicalClustering:Example2024/9/921HierarchicalClustering:Example2024/9/922HierarchicalClustering:Example2024/9/923HierarchicalClustering(cont’d)HierarchicalClusteringisoftenusedtorevealevolutionaryhistory2024/9/924HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elementsComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTThealgorithmtakesanxndistancematrixdofpairwisedistancesbetweenpointsasaninput.2024/9/925HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elements
ComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTDifferentwaystodefinedistancesbetweenclustersmayleadtodifferentclusterings2024/9/926HierarchicalClustering:RecomputingDistances
dmin(C,C*)=mind(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersisthesmallestdistancebetweenanypairoftheirelements
davg(C,C*)=(1/|C*||C|)∑d(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersistheaveragedistancebetweenallpairsoftheirelements2024/9/927系統(tǒng)聚類(lèi)例:微陣列數(shù)據(jù)2024/9/928評(píng)估表達(dá)模式的相似性?xún)尚袛?shù)據(jù)之間的相似性或者距離如何量化。歐幾里德距離。采用pearson相關(guān)系數(shù)r(-1,1)。如果兩個(gè)基因之間r為1,說(shuō)明兩個(gè)數(shù)據(jù)表達(dá)模式吻合得很好如果兩個(gè)基因之間r為-1,也說(shuō)明兩個(gè)數(shù)據(jù)表達(dá)模式吻合得很好(一上升,一下降)r=0,則說(shuō)明表達(dá)模式之間沒(méi)什么相關(guān)性2024/9/929數(shù)據(jù)標(biāo)準(zhǔn)化計(jì)算第2個(gè)和第10個(gè)基因的平均值和標(biāo)準(zhǔn)方差減去平均值,然后除以標(biāo)準(zhǔn)方差,得到每行的標(biāo)準(zhǔn)化數(shù)據(jù)2024/9/930求pearson相關(guān)系數(shù)求經(jīng)過(guò)標(biāo)準(zhǔn)化以后的兩向量的內(nèi)積,再除以元素個(gè)數(shù)2024/9/931分析基因2,11與基因6,10之間表達(dá)比值正好各自相反,因此相關(guān)系數(shù)r(2,11),r(6,10)應(yīng)該是-1。2024/9/932數(shù)據(jù)標(biāo)準(zhǔn)化以后基因兩兩之間的相關(guān)系數(shù)2024/9/933根據(jù)相關(guān)系數(shù)進(jìn)行聚類(lèi)(層次聚類(lèi)法)1,計(jì)算所有元素兩兩之間的距離(相關(guān)系數(shù)),創(chuàng)建一個(gè)距離矩陣。每個(gè)元素就是一個(gè)類(lèi),僅僅包含它自己。2,尋找距離最小的兩個(gè)類(lèi)(相關(guān)系數(shù)最大)。3,將這兩個(gè)類(lèi)合并為一個(gè)新的類(lèi)。新的類(lèi)替換這兩個(gè)類(lèi),重新計(jì)算所有的距離,修改相似性矩陣。4,重復(fù)2,3步驟直到所有的類(lèi)聚集為一個(gè)類(lèi)。2024/9/934迭代過(guò)程首先會(huì)發(fā)現(xiàn)r(5,10)=1,然后把基因5和10歸為一類(lèi),然后需要重新計(jì)算距離矩陣。2024/9/935聚類(lèi)圖2024/9/936主要內(nèi)容Microarrays(微陣列)HierarchicalClustering(層次聚類(lèi)或系統(tǒng)聚類(lèi))K-MeansClustering(K-均值聚類(lèi))2024/9/937SquaredErrorDistortion(平方誤差失真)Givenadatapoint
vandasetofpointsX,definethedistancefromvtoX
d(v,X)asthe(Eucledian)distancefromvtotheclosestpointfromX.Givenasetofndatapoints
V={v1…vn}andasetofkpointsX,definetheSquaredErrorDistortion
d(V,X)=∑d(vi,X)2/n1<
i
<
n
2024/9/938K-MeansClusteringProblem:FormulationInput:Aset,V,consistingofnpointsandaparameterkOutput:AsetXconsistingofkpoints(clustercenters)thatminimizesthesquarederrordistortiond(V,X)overallpossiblechoicesofX。2024/9/9391-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx。
2024/9/9401-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx
1-MeansClusteringproblemiseasy.However,itbecomesverydifficult(NP-complete)formorethanonecenter.AnefficientheuristicmethodforK-MeansclusteringistheLloydalgorithm
2024/9/941K-MeansClustering:LloydAlgorithmLloydAlgorithmArbitrarilyassignthekclustercenterswhiletheclustercenterskeepchangingAssigneachdatapointtotheclusterCi correspondingtotheclosestcluster representative(center)(1≤i≤k)Aftertheassignmentofalldatapoints, computenewclusterrepresentatives
accordingtothecenterofgravityofeach cluster,thatis,thenewclusterrepresentativeisforallvinCforeveryclusterC
*Thismayleadtomerelyalocallyoptimalclustering.2024/9/942x1x2x32024/9/943x1x2x32024/9/944x1x2x32024/9/945x1x2x32024/9/946ConservativeK-MeansAlgorithmLloydalgorithmisfastbutineachiterationitmovesmanydatapoints,notnecessarilycausingbetterconvergence.AmoreconservativemethodwouldbetomoveonepointatatimeonlyifitimprovestheoverallclusteringcostThesmallertheclusteringcostof
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度定制房地產(chǎn)項(xiàng)目投資合同2篇
- 2025年人教五四新版九年級(jí)化學(xué)上冊(cè)月考試卷
- 二零二五年度農(nóng)業(yè)現(xiàn)代化公對(duì)公轉(zhuǎn)賬貸款協(xié)議3篇
- 2025年人教版一年級(jí)英語(yǔ)上冊(cè)月考試卷含答案
- 二零二五年度安徽裝配式建筑工程施工合同正規(guī)范本
- 二零二五年度房產(chǎn)抵押經(jīng)營(yíng)性租賃經(jīng)營(yíng)權(quán)租賃權(quán)轉(zhuǎn)讓合同范本2篇
- 2025年蘇教版選擇性必修2歷史下冊(cè)階段測(cè)試試卷含答案
- 二零二五年度醫(yī)療健康保險(xiǎn)產(chǎn)品設(shè)計(jì)與管理合同3篇
- 2025年湘教新版高二數(shù)學(xué)下冊(cè)月考試卷
- 2025年滬科新版八年級(jí)科學(xué)下冊(cè)月考試卷
- 窺見(jiàn)中華文明之光- 高中語(yǔ)文統(tǒng)編版(2022)必修下冊(cè)第一單元整體教學(xué)設(shè)計(jì)
- 2024年工程部年終總結(jié)
- 七年級(jí)上冊(cè)道德與法治2023-2024期末試題附答案系列
- 內(nèi)科護(hù)理學(xué)重點(diǎn)總結(jié)
- 2019年海南省公務(wù)員考試申論真題(甲類(lèi))
- 事業(yè)部制改革方案
- 2025屆廣東省揭陽(yáng)市高一生物第一學(xué)期期末統(tǒng)考模擬試題含解析
- CSR報(bào)告與可持續(xù)發(fā)展
- 蘇教版五年級(jí)上冊(cè)小數(shù)混合運(yùn)算300道及答案
- 第一章問(wèn)題解決策略:分類(lèi)討論 教案 2024-2025學(xué)年 魯教版(五四制)六年級(jí)數(shù)學(xué)上冊(cè)
- 期末復(fù)習(xí)知識(shí)點(diǎn)-2024-2025學(xué)年統(tǒng)編版道德與法治九年級(jí)上冊(cè)
評(píng)論
0/150
提交評(píng)論