WGCNA新手入門筆記2(含代碼和數(shù)據(jù))

上傳人：m*** IP屬地：貴州上傳時間：2020-05-17 格式：DOC 頁數(shù)：13 大小：30.50KB 積分：20 舉報 版權(quán)申訴

已閱讀5頁，還剩8頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

WGCNA新手入門筆記2（含代碼和數(shù)據(jù)）上次我們介紹了WGCNA的入門（WGCNA新手入門筆記（含代碼和數(shù)據(jù)），大家在安裝WGCNA包的時候，可能會遇到GO.db這個包安裝不了的問題。主要問題應(yīng)該是出在電腦的防火墻，安裝時請關(guān)閉防火墻。如果還有問題，請先單獨安裝AnnotationDbi這個包，biocLite(AnnotationDbi)再安裝GO.db，并嘗試從本地文件安裝該包。如果還有問題，請使用管理員身份運行R語言，嘗試上述步驟。另外如果大家問題解決了請在留言處留個言，告知大家是在哪一步解決了問題，謝謝！因為本人沒有進行單因素實驗，不知道到底是哪個因素改變了實驗結(jié)果。今天給大家過一遍代碼。網(wǎng)盤中有代碼和數(shù)據(jù)。鏈接：/s/1bpvu9Dt密碼：w7g4#導(dǎo)入數(shù)據(jù)#library(WGCNA)options(stringsAsFactors = FALSE)enableWGCNAThreads() enableWGCNAThreads()是指允許R語言程序最大線程運行，像我這電腦是4核CPU的，那么就能用上3核：當然如果當前電腦沒別的事，也可以滿負荷運作samples=read.csv( Sam_info.txt,sep = t,s = 1)expro=read.csv( ExpData.txt,sep = t,s = 1)dim(expro) 這部分代碼是為了讓R語言讀取外部數(shù)據(jù)。當然了在讀取數(shù)據(jù)之前首先改變一下工作目錄，這一點在周二的文章中提過了。R語言讀取外部數(shù)據(jù)的方式常用的有read.table和read.csv，這里用的是read.csv，想要查看某一函數(shù)的具體參數(shù)，可以用？函數(shù)名查看，比如：大家可以注意到read.table和read.csv中header參數(shù)的默認值是不同的，header=true表示第一行是標題，第二行才是數(shù)據(jù)，header=false則表示第一行就是數(shù)據(jù)，沒有標題。#篩選方差前25%的基因#m.vars=apply(expro, 1,var)expro.upper=exprowhich(m.varsquantile(m.vars, probs = seq( 0, 1, 0.25) 4),dim(expro.upper)datExpr= as.data.frame(t(expro.upper);nGenes = ncol(datExpr)nSamples = nrow(datExpr) 這一步是為了減少運算量，因為一個測序數(shù)據(jù)可能會有好幾萬個探針，而可能其中很多基因在各個樣本中的表達情況并沒有什么太大變化，為了減少運算量，這里我們篩選方差前25%的基因。當然了，以下幾種情況，你可以忽略此步，轉(zhuǎn)而執(zhí)行下列代碼datExpr= as.data.frame(t(expro);nGenes = ncol(datExpr)nSamples = nrow(datExpr) 情況1：所用的數(shù)據(jù)是比較老的芯片數(shù)據(jù)，探針數(shù)量較少；情況2：你的電腦足夠強大，不必減少運算；情況3：你第一步導(dǎo)入的數(shù)據(jù)是差異基因的數(shù)據(jù)，已經(jīng)作過初篩，比如這一文章的套路（PMID：）（個人不建議這么做）。#樣本聚類檢查離群值#gsg = goodSamplesGenes(datExpr, verbose = 3);gsg$allOKsampleTree = hclust(dist(datExpr), method = average)plot(sampleTree, main = Sample clustering to detect outliers, sub= , xlab= )save(datExpr, file = FPKM-01-dataInput.RData) 執(zhí)行這一段代碼我們會得到下面這張圖：從結(jié)果上來，我們分析的樣本沒啥離群值，所以代碼里說不作處理。在網(wǎng)上的一個案例中，離群的樣本就比較明顯了。（https:/www.shengxin.ren/article/88）如果需要去除離群樣本，則執(zhí)行下列代碼，其中cutHeight=多少就看你自己了。clust = cutreeStatic(sampleTree, cutHeight = 20000, minSize = 10)table(clust)keepSamples = (clust= 1)datExpr = datExprkeepSamples, nGenes = ncol(datExpr)nSamples = nrow(datExpr)save(datExpr, file = FPKM-01-dataInput.RData) 執(zhí)行上述代碼的話，就會去掉8個樣本#軟閾值篩選#powers = c(c( 1: 10), seq( from= 12, to= 20, by= 2)sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)par(mfrow = c( 1, 2);cex1 = 0.9;plot(sft$fitIndices, 1, -sign(sft$fitIndices, 3)*sft$fitIndices, 2, xlab= Soft Threshold (power),ylab= Scale Free Topology Model Fit,signed R2,type= n, main = paste( Scale independence);text(sft$fitIndices, 1, -sign(sft$fitIndices, 3)*sft$fitIndices, 2, labels=powers,cex=cex1,col= red);abline(h= 0.90,col= red)plot(sft$fitIndices, 1, sft$fitIndices, 5, xlab= Soft Threshold (power),ylab= Mean Connectivity, type= n, main = paste( Mean connectivity)text(sft$fitIndices, 1, sft$fitIndices, 5, labels=powers, cex=cex1,col= red) 軟閾值是WGCNA的算法中非常重要的一個環(huán)節(jié)，簡單的說硬閾值是一種一刀切的算法，比如高考分數(shù)500分能上一本，低于500就不行，軟閾值的話切起來比較柔和一些，會考慮你學校怎么樣，平時成績怎么樣之類。網(wǎng)盤中的數(shù)據(jù)跑起來其實是不太好的，沒有合適的軟閾值，這根線是要劃到0.9的?；蛘哂疫呥@張圖趨近于0像下面這個就是比較正常的。（/2535.html）這一步是為了算powers的值。一般來說，powers取紅線（0.9）左右的數(shù)字都是可以的。如果你天秤座特征比較明顯，你也可以運行下列代碼，讓程序推薦你一個值（本案例中返回值是NA，所以后面為了讓程序能夠進行下去，選了powers=14）。sft$powerEstimate 這個powers的值在后面的代碼中會一直用到，所以你在跑別的數(shù)據(jù)的時候一定要更改powers的數(shù)值。#一步法網(wǎng)絡(luò)構(gòu)建：One-step network construction and module detection#net = blockwiseModules(datExpr, power = 14, maxBlockSize = 6000, TOMType = unsigned, minModuleSize = 30, reassignThreshold = 0, mergeCutHeight = 0.25, numericLabels = TRUE, pamRespectsDendro = FALSE, saveTOMs = TRUE, saveTOMFileBase = AS-green-FPKM-TOM, verbose = 3)table(net$colors) 這一步就比較燒電腦了，關(guān)于網(wǎng)絡(luò)構(gòu)建的方法，分為一步法和多步法，一步法比較無腦，多步法能設(shè)置更多參數(shù)。想要了解多步法的請參考此文http:/tiramisutes.github.io/2016/09/14/WGCNA.html繼續(xù)說上面的代碼，結(jié)果跑出來如下圖結(jié)果是每個模塊中包含的基因數(shù)量。一般來說，結(jié)果包含十幾個到二十幾個模塊是比較正常的，此外一個模塊中的基因數(shù)量不宜過多，像我們這個結(jié)果里模塊1的基因數(shù)量達到了2651，這個就有點太多了，主要是因為我們powers=14，軟閾值太低了導(dǎo)致的。所以說上述軟閾值的篩選可以對我們的模塊分析起到微調(diào)的作用。#繪畫結(jié)果展示# open a graphics window#sizeGrWindow(12, 9)# Convert labels to colors for plottingmergedColors = labels2colors(net$colors) # Plot the dendrogram and the module colors underneathplotDendroAndColors(net$dendrograms 1, mergedColorsnet$blockGenes 1, Module colors, dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05) 由于我們的軟閾值比較低，所以這一結(jié)果中幾乎沒有g(shù)rey模塊，grey模塊中的基因是共表達分析時沒有被接受的基因，可以理解為一群散兵游勇。當然如果分析結(jié)果中g(shù)rey模塊中的基因數(shù)量比較多也是不太好的，表示樣本中的基因共表達趨勢不明顯，不同特征的樣本之間差異性不大，或者組內(nèi)基因表達一致性比較差。#結(jié)果保存#moduleLabels = net$colorsmoduleColors = labels2colors(net$colors)table(moduleColors)MEs = net$MEs;geneTree = net$dendrograms 1;save(MEs, moduleLabels, moduleColors, geneTree, file = AS-green-FPKM-02-networkConstruction-auto.RData) 這一步就是保存上面跑出來的結(jié)果了，同時哪個模塊有多少基因一目了然。#表型與模塊相關(guān)性#moduleLabelsAutomatic = net$colorsmoduleColorsAutomatic = labels2colors(moduleLabelsAutomatic)moduleColorsWW = moduleColorsAutomaticMEs0 = moduleEigengenes(datExpr, moduleColorsWW)$eigengenesMEsWW = orderMEs(MEs0)modTraitCor = cor(MEsWW, samples, use = p)colnames(MEsWW)modlues=MEsWWmodTraitP = corPvalueStudent(modTraitCor, nSamples)textMatrix = paste(signif(modTraitCor, 2), n(, signif(modTraitP, 1), ), sep = )dim(textMatrix) = dim(modTraitCor)labeledHeatmap(Matrix = modTraitCor, xLabels = colnames(samples), yLabels = names(MEsWW), cex.lab = 0.9, yColorWidth= 0.01, xColorWidth = 0.03, ySymbols = colnames(modlues), colorLabels = FALSE, colors = blueWhiteRed( 50), textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim = c(- 1, 1) , main = paste( Module-trait relationships) 這一步和網(wǎng)上的很多代碼會有一些區(qū)別，有些代碼會在這一環(huán)節(jié)構(gòu)建樣本特征的矩陣，而我們在最開始導(dǎo)入數(shù)據(jù)的時候，就是導(dǎo)入完整的樣本特征矩陣，在這里直接讀取就好了。（本人技術(shù)比較菜，所以在可視化的情況構(gòu)建好矩陣，用代碼憑空做一個矩陣，腦子不太夠用）cex.lab可以更改X軸Y軸label字體的大小，cex.text可以更改熱圖中字體的大小，colors可以改變顏色。如果出現(xiàn)下述問題：請將繪圖窗口最大化，然后最小化收起來：樣本特征和共表達模塊的相關(guān)性熱圖中，grey模塊中的相關(guān)性應(yīng)該很小，如果你與樣本特征相關(guān)性最顯著的模塊是grey模塊，那肯定是有問題的，畢竟grey模塊中的基因是一群散兵游勇，它們的表達在各個樣本中雜亂無章，根本說明不了問題。#導(dǎo)出網(wǎng)絡(luò)到Cytoscape# Recalculate topological overlap if neededTOM = TOMsimilarityFromExpr(datExpr, power = 14); # Read in the annotation file# annot = read.csv(file = GeneAnnotation.csv);# Select modules需要修改，選擇需要導(dǎo)出的模塊顏色modules = c( lightgreen); # Select module probes選擇模塊探測probes = names(datExpr)inModule = is.finite(match(moduleColors, modules);modProbes = probesinModule; #modGenes = annot$gene_symbolmatch(modProbes, annot$substanceBXH);# Select the corresponding Topological OverlapmodTOM = TOMinModule, inModule;dimnames(modTOM) = list(modProbes, modProbes) # Export the network into edge and node list files Cytoscape can readcyt = exportNetworkToCytoscape(modTOM, edgeFile = paste( AS-green-FPKM-One-step-CytoscapeInput-edges-, paste(modules, collapse= -), .txt, sep= ), nodeFile = paste( AS-green-FPKM-One-step-CytoscapeInput-nodes-, paste(modules, collapse= -), .txt, sep= ), weighted = TRUE, threshold = 0.02, nodeNames = modProbes, #altNodeNames = modGenes,nodeAttr = moduleColorsinModule); 這一步就是把選定的模塊中的基因?qū)С鰜恚Y(jié)果包含edges和nodes的信息。導(dǎo)出不同模塊的基因只需要改變modules = c(模塊顏色名)即可，輸出多個模塊的信息時，從該行代碼運行即可，前面一行的運算量很大。edges文件很大，試想一個模塊中有500個基因，幾乎兩兩基因之間都有關(guān)系，那就有上萬條信息，構(gòu)建出來的網(wǎng)絡(luò)肯定密密麻麻的用不了。這里處理辦法有兩種：1、取Weight值前多少的作用關(guān)系；2、選定seed基因，比如某個lncRNA或者已知與表型具有密切關(guān)聯(lián)的基因，構(gòu)建與該基因有關(guān)的共表達網(wǎng)絡(luò)(Time-Course Analysis of Brain Regional Expression Network Responses to Chronic Intermittent Ethanol and Withdrawal: Implications for Mechanisms Underlying Excessive Ethanol Consumption)# 可視化基因網(wǎng)絡(luò)# # Calculate topological overlap anew: this could be done more efficiently by saving the TOM# calculated during module detection, but let us do it again here.dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 14); # Transform dissTOM with a power to make moderately strong connections more visible in the heatmapplotTOM = dissTOM 7; # Set diagonal to NA for a nicer plotdiag(plotTOM) = NA; # Call the plot function#sizeGrWindow(9,9)TOMplot(plotTOM, geneTree, moduleColors, main = Network heatmap plot, all genes) #隨便選取1000個基因來可視化nSelect = 1000# For reproducibility, we set the random seedset.seed( 10);select = sample(nGenes, size = nSelect);selectTOM = dissTOMselect, select; # Theres no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.selectTree = hclust( as.dist(selectTOM), method = average)selectColors = moduleColorsselect; # Ope

人人文庫> 全部分類> 應(yīng)用文書 > 事務(wù)文書

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權(quán)或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

WGCNA新手入門筆記2(含代碼和數(shù)據(jù))

文檔簡介

溫馨提示

最新文檔

評論

WGCNA新手入門筆記2(含代碼和數(shù)據(jù))

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔