經(jīng)典的轉(zhuǎn)錄組差異分析通常會使用到三個(gè)工具limma/voom, edgeR和DESeq2。今天我們就通過一個(gè)小規(guī)模的轉(zhuǎn)錄組測序數(shù)據(jù)來演示DESeq2的簡單流程。

? 文中使用的數(shù)據(jù)來自Standford 大學(xué)的一個(gè)擬南芥的small RNA-seq數(shù)據(jù)(https://bios221.stanford.edu/data/mobData.RData)
該數(shù)據(jù)包含6個(gè)樣本:SL236,SL260,SL237,SL238,SL239,SL240, 并分成了三組,分別是:? MM="triple mutatnt shoot grafted onto triple mutant root"
? WM="wild-type shoot grafted onto triple mutant root"
? WW="wild-type shoot grafted onto wild-type root"
簡而言之,
WW組可以認(rèn)為是實(shí)驗(yàn)的對照組,而MM和WM則是兩個(gè)處理組。
對于DESeq2的分析流程而言,我們需要輸入的數(shù)據(jù)包括:
- 表達(dá)矩陣(
countData) - 分組信息(
colData) - 擬合信息(
design):指明如何根據(jù)樣本的分組進(jìn)行建模
? 下面就以mobData 中的數(shù)據(jù)為例簡單介紹DESeq2 的分析流程
載入數(shù)據(jù)及生成DESeqDataSet
? 由于mobData 中的行名沒有提供基因的ID,我們也不是為了探究生物學(xué)問題,就以mobData 的行數(shù)作為其ID
library(DESeq2)
library(dplyr)
load("data/mobData.RData")
head(mobData)
## SL236 SL260 SL237 SL238 SL239 SL240
## [1,] 21 52 4 4 86 68
## [2,] 18 21 1 5 1 1
## [3,] 1 2 2 3 0 0
## [4,] 68 87 270 184 396 368
## [5,] 68 87 270 183 396 368
## [6,] 1 0 6 10 6 12
dim(mobData)
## 3000 6
row.names(mobData) <- as.character(c(1:dim(mobData)[1]))
mobGroups <- c("MM", "MM", "WM", "WM", "WW", "WW")
colData <- data.frame(row.names=colnames(mobData),
group_list=mobGroups)
dds <- DESeqDataSetFromMatrix(countData = mobData,
colData = colData,
design = ~group_list,
tidy = F)
dds
## class: DESeqDataSet
## dim: 3000 6
## metadata(1): version
## assays(1): counts
## rownames(3000): 1 2 ... 2999 3000
## rowData names(0):
## colnames(6): SL236 SL260 ... SL239 SL240
## colData names(1): group_list
DESeqDataSet 是DESeq2流程中儲存read counts和中間統(tǒng)計(jì)分析數(shù)據(jù)的對象,之后的分析都建立在該對象之上進(jìn)行。
DESeqDataSet可以通過以下四種方法產(chǎn)生:
- From transcript abundance files and tximport ——
DESeqDataSetFromTximport- From a count matrix ——
DESeqDataSetFromMatrix- From htseq-count files ——
DESeqDataSetFromHTseqCount- From a SummarizedExperiment object ——
DESeqDataSet
預(yù)處理
? 在進(jìn)行差異分析之前,需要對樣本數(shù)據(jù)的表達(dá)矩陣進(jìn)行預(yù)處理,包括:
- 去除低表達(dá)量基因
- 探索樣本分組信息 -- 有助于挖掘潛在的差異樣本
table(rowSums(counts(dds)==0))
##
## 0 1 2 3 4 5 6
## 1833 442 324 170 92 84 55
keep <- rowSums(counts(dds)==0) < 6
dds.keep <- dds[keep, ]
dim(dds.keep)
## [1] 2945 6
library(factoextra)
library(FactoMineR)
count.pca <- counts(dds.keep) %>% as.matrix %>% t %>% PCA(.,graph = F)
fviz_pca_ind(count.pca,
col.ind = mobGroups,
addEllipses = F,
mean.point = F,
legend.title = "Groups")

colnames(dds.keep) <- c(paste(colnames(dds.keep),mobGroups, sep ="."))
counts(dds.keep) %>% as.matrix %>% t %>% dist %>% hclust %>% plot
colnames(dds.keep) <- colnames(dds)

? 通過PCA結(jié)果來看各組樣本分組情況還是不錯的,但hclust的聚類結(jié)果反映的分組就略微有點(diǎn)混雜了,可能要聚類計(jì)算的距離函數(shù)選用不當(dāng)有關(guān)。
差異分析
使用DESeq()函數(shù)進(jìn)行差異分析時(shí),該函數(shù)干了以下三件事:
- 計(jì)算文庫大小,預(yù)測用于標(biāo)準(zhǔn)化的size factor
- 預(yù)測離散度(dispersions)
- 使用負(fù)二項(xiàng)分布的廣義線性回歸模型(GLM)進(jìn)行擬合,以及進(jìn)行Wald test
The Wald test (also called the Wald Chi-Squared Test) is a way to find out if explanatory variables in a model are significant.
degobj <- DESeq(dds.keep);degobj
## class: DESeqDataSet
## dim: 2945 6
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(2945): 1 2 ... 2999 3000
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(6): SL236 SL260 ... SL239 SL240
## colData names(2): group_list sizeFactor
countdds <- counts(degobj, normalized = T)
deg.res1 <- results(degobj, contrast=c("group_list","MM","WW"), pAdjustMethod = 'fdr')
? counts() 可以提取DESeq object中的表達(dá)矩陣,而results() 可以提取差異分析的結(jié)果,其中包括了:
樣本間的均值, log2 fold changes, standard errors, test statistics, p-values and adjusted p-values.
? 使用results()函數(shù)時(shí)需要指明進(jìn)行比較的樣本,這里用 contrast=c("group_list","MM","WW") 提取MM組和WW組進(jìn)行差異分析的結(jié)果。如果想要比較WM組和WW組,只要改變contrast=c("group_list","WM","WW") 即可。
# result default
results(object, contrast, name, lfcThreshold = 0,
altHypothesis = c("greaterAbs", "lessAbs", "greater", "less"),
listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE,
alpha = 0.1, filter, theta, pAdjustMethod = "BH", filterFun,
format = c("DataFrame", "GRanges", "GRangesList"), test,
addMLE = FALSE, tidy = FALSE, parallel = FALSE,
BPPARAM = bpparam(), minmu = 0.5)
? 檢查結(jié)果中是否包含NA值
dim(deg.res1)
## [1] 2945 6
apply(deg.res1,2, function(x) sum(is.na(x)))
## baseMean log2FoldChange lfcSE stat pvalue
## 0 0 0 0 0
## padj
## 1142
? 這里padj中有1142個(gè)NA值是因?yàn)槭褂?code>results()提取差異分析結(jié)果時(shí),大于alpha值(這里是0.1)的矯正后p-value都會被當(dāng)做是NA。因此,我們將這些padj值都設(shè)為1
deg.res1$padj[is.na(deg.res1$padj)] <- 1
table(is.na(deg.res1$padj))
##
## FALSE
## 2945
summary(deg.res1)
##
## out of 2945 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 102, 3.5%
## LFC < 0 (down) : 215, 7.3%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
plotMA(deg.res1)

排序后以log2FoldChange絕對值大于1, padj小于0.05為條件篩選顯著的差異表達(dá)基因
deg.ord1 <- deg.res1[order(deg.res1$padj, decreasing = F),]
deg.sig1 <- subset(deg.ord1,abs(deg.ord1$log2FoldChange)>1 & deg.ord1$padj<0.05)
deg.sig1
## log2 fold change (MLE): group_list MM vs WW
## Wald test p-value: group_list MM vs WW
## DataFrame with 217 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## 2500 159.774533763595 -4.11069945121082 0.432943287486228 -9.49477580557611
## 1212 112.711030407211 3.96456274705605 0.449008108800994 8.82960166942819
## 1917 303.58162963147 3.45479344900962 0.406885893259752 8.49081648255649
## 490 71.687159921068 -6.18926587678444 0.753506435554632 -8.21395224345857
## 1317 222.718984861955 -3.03292643435534 0.37190983516799 -8.15500464779421
## ... ... ... ... ...
## 1792 5.98829479907705 -4.93244157414254 1.77616534224116 -2.77701712607386
## 98 6699.69196582045 1.19245883058489 0.429894831042278 2.7738384936934
## 295 6699.58197403621 1.19259109051591 0.429948214187822 2.77380170718637
## 662 5.15806783175131 -5.09298667756621 1.83903942690847 -2.76937329512713
## 899 18.1185589826749 -3.43948745771197 1.24400277511213 -2.76485513257955
## pvalue padj
## <numeric> <numeric>
## 2500 2.20685744092375e-21 3.97896396598552e-18
## 1212 1.05049159823894e-18 9.47018175812401e-16
## 1917 2.05190369014182e-17 1.23319411777524e-14
## 490 2.14024730609951e-16 9.64716473224354e-14
## 1317 3.4916644951535e-16 1.25909421695235e-13
## ... ... ...
## 1792 0.00548602882496584 0.046221074632773
## 98 0.00553991737467495 0.0462481504691228
## 295 0.0055405438166004 0.0462481504691228
## 662 0.00561642447454876 0.0466654992055826
## 899 0.00569480805304419 0.0468846526010898
? 至此,便篩選出了217個(gè)在MM組和WW組之間的顯著差異表達(dá)基因。至于后續(xù)的可視化分析則是因課題而異了,等以后有空了再補(bǔ)坑吧!
? 有同學(xué)可能注意到,雖然我們的樣本有多個(gè)組,但在差異分析時(shí)進(jìn)行的還是pairwise的分析,為什么我們不可以三個(gè)組一起分析呢?學(xué)過ANOVA的同學(xué)應(yīng)該都知道,ANOVA就是可以應(yīng)對這種多組差異分析的情況。但要注意的是,ANOVA只可以告訴我們對于某個(gè)基因在這三組中是否存在差異,想要找出是哪一組有其他組別有差異還是需要進(jìn)行pairwise t-test之類的分析。所以,在這里我們兩組兩組地進(jìn)行分析正是出于這個(gè)考慮,并且有更方便我們解釋差異分析的結(jié)果,說明在A組基因的表達(dá)量相對于B組的是上調(diào)還是下調(diào)。另外,本文的差異分析還是處于單因子水平(只有一個(gè)變量),至于多因子的差異分析以后研究透了再和大家進(jìn)行分享。
一點(diǎn)想法:
? 最近想要整理這些轉(zhuǎn)錄組常規(guī)分析流程,一來是因?yàn)橹皩W(xué)習(xí)的時(shí)候很零碎,想借此機(jī)會系統(tǒng)性地整理這方面的知識。另一方面也是因?yàn)槲腋杏X平時(shí)做濕實(shí)驗(yàn)的時(shí)候都講究個(gè)protocol,一些基本的實(shí)驗(yàn),像PCR這種我們只要根據(jù)具體實(shí)驗(yàn)調(diào)整一些參數(shù)就都可以跟著protocol來做了。所以,一些經(jīng)典的生信分析也可以有一個(gè)這樣的“protocol”,要用的時(shí)候調(diào)整參數(shù)就可以了。
參考文章:
- https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html
- https://www.researchgate.net/post/How_to_interpret_results_of_DESeq2_with_more_than_two_experimental_groups
- https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq
- https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf
完。
