「用一個(gè)更復(fù)雜的例子,來(lái)深入學(xué)習(xí)DESeq2差異表達(dá)分析后的小分析」

這篇文章,對(duì)Griffith Lab的DESeq2分析流程做一個(gè)解讀。

理解數(shù)據(jù)

Griffith Lab所使用的基因表達(dá)量矩陣總共包含了54個(gè)sample,這些sample可以劃分為1)normal,2)primary tumor以及3)colorectal cancer metastatic in the liver

從差異分析之后開始

獲取差異表達(dá)分析的結(jié)果

在使用DESeq()函數(shù)完成差異表達(dá)分析之后(此處還是DESeq對(duì)象),獲取其分析結(jié)果,需要用到函數(shù)results()。

同時(shí),想要提取對(duì)應(yīng)組合差異表達(dá)分析的結(jié)果,需要用到contrast=c()參數(shù),

Note:

  • contrast()的輸入為3個(gè)字符串向量,1)colData中包含樣本分類信息的列,2)分組1(log2FoldChange的分子),3)分組2(log2FoldChange的分母)。

  • 可以使用summary()對(duì)創(chuàng)建的DESeq對(duì)象進(jìn)行一些參數(shù)統(tǒng)計(jì),e.g.

    上調(diào)基因的數(shù)量

    下調(diào)基因的數(shù)量

    離群gene占比

    被判定為low counts的gene占比

MA plot怎么畫?

以下面這張圖為例,

  • x軸為,gene標(biāo)準(zhǔn)化后對(duì)應(yīng)的read counts

  • y軸為,2組實(shí)驗(yàn)條件下gene表達(dá)量的倍數(shù),以log_{2}形式呈現(xiàn)

    log2FoldChange>0,代表在分組1中為上調(diào),分組2中為下調(diào)

    log2FoldChange<0,代表在分組2中為下調(diào),分組2中為上調(diào)

這張MA plot中的red dot代表表達(dá)量呈現(xiàn)顯著差異的gene,grey dot代表表達(dá)量沒有呈現(xiàn)顯著差異的gene。

Note:log_{2}(正常形式的倍數(shù))=this \,\,value

用DESeq2自帶代碼畫MA plot

繪圖代碼如下,

# 直接畫就行
plotMA(deseq2Results)

用ggplot2畫MA plot

1、使用ggplot2仿著plotMA

繪圖結(jié)果如下,

繪圖代碼如下,

library(ggplot2)
library(scales) # needed for oob parameter
library(viridis)

# 1.將DESeq結(jié)果轉(zhuǎn)換為dataframe
deseq2ResDF <- as.data.frame(deseq2Results)
# 2. 對(duì)gene添加“是否呈現(xiàn)顯著差異表達(dá)”的標(biāo)簽
deseq2ResDF$significant <- ifelse(deseq2ResDF$padj < .1, "Significant", NA)
# 以baseMean作為x,log2FoldChange作為y
# 以significant作為分類變量
ggplot(deseq2ResDF, aes(baseMean, log2FoldChange, colour=significant)) + 
    geom_point(size=1) + 
    scale_y_continuous(limits=c(-3, 3), oob=squish) + 
    scale_x_log10() + 
    geom_hline(yintercept = 0, colour="tomato1", size=2) + 
    labs(x="mean of normalized counts", y="log fold change") + 
    scale_colour_manual(name="q-value", values=("Significant"="red"), na.value="grey50") +
    theme_bw()

在繪圖細(xì)節(jié)上,有幾個(gè)需要注意的地方,

  • 導(dǎo)入library(scales)之后,在scale_y_continuous()處添加的oob=squish參數(shù),不會(huì)將超出所設(shè)置的y軸范圍之外的點(diǎn)排除(即轉(zhuǎn)變?yōu)镹A),而是將它們“擠”在邊緣
  • 設(shè)置scale_x_log10(),將baseMean轉(zhuǎn)換為以10為底的對(duì)數(shù)
2、ggplot2高級(jí)版

繪圖結(jié)果如下,


繪圖代碼如下,

library(viridis)
ggplot(deseq2ResDF, aes(baseMean, log2FoldChange, colour=padj)) + 
    geom_point(size=1) + 
    scale_y_continuous(limits=c(-3, 3), oob=squish) + 
    scale_x_log10() + 
    geom_hline(yintercept = 0, colour="darkorchid4", size=1, linetype="longdash") + 
    labs(x="mean of normalized counts", y="log fold change") + 
    scale_colour_viridis(direction=-1, trans='sqrt') + 
    theme_bw() + 
    geom_density_2d(colour="black", size=2)

增添了2個(gè)函數(shù),

  • scale_colour_viridis(),針對(duì)的是映射函數(shù)中設(shè)置的colour=選項(xiàng)。

    此處將padj,即FDR矯正過(guò)后的p-value,呈現(xiàn)為一個(gè)梯度變化的可視化結(jié)果(從紫到黃,即代表了從非常不顯著到極顯著)

    direction=設(shè)置顏色的梯度走向,若設(shè)置為1,與上述結(jié)果呈現(xiàn)相反的走向

單個(gè)基因的表達(dá)量變化怎么畫?

繪圖部分

先使用plotCounts()函數(shù),將對(duì)應(yīng)gene的表達(dá)量數(shù)據(jù)提取出來(lái),

otop2Counts <- plotCounts(deseq2Data, 
                          gene="ENSG00000183034", 
                          intgroup=c("tissueType", "individualID"), 
                          returnData=TRUE)

需要給定intgroup=c()指定參數(shù),

  • 第一個(gè)參數(shù)為colData中的樣本分類信息列名(e.g. 此處的為tissueType,還有很多其他的例子,使用的是condition)
  • 第二個(gè)參數(shù)為樣本 ID信息

繪圖代碼如下,

colourPallette <- c("#7145cd","#bbcfc4","#90de4a","#cd46c1","#77dd8e","#592b79","#d7c847","#6378c9","#619a3c","#d44473","#63cfb6","#dd5d36","#5db2ce","#8d3b28","#b1a4cb","#af8439","#c679c0","#4e703f","#753148","#cac88e","#352b48","#cd8d88","#463d25","#556f73")

ggplot(otop2Counts, aes(x=tissueType, y=count, colour=individualID, group=individualID)) + 
    geom_point() + 
    geom_line() + 
    theme_bw() + 
    theme(axis.text.x=element_text(angle=15, hjust=1)) + 
    scale_colour_manual(values=colourPallette) +  # 將已經(jīng)調(diào)好的顏色傳遞給scale_colour_manual()
    guides(colour=guide_legend(ncol=3)) + 
    ggtitle("OTOP2")

繪圖結(jié)果如下,

「這個(gè)圖好看的點(diǎn),我自己覺得有兩個(gè),設(shè)置了2個(gè)分組,讓結(jié)果非常清晰」

驗(yàn)證read counts部分

deseq2ResDF["ENSG00000183034",]
rawCounts["ENSG00000183034",]
normals=row.names(sampleData[sampleData[,"tissueType"]=="normal-looking surrounding colonic epithelium",])
primaries=row.names(sampleData[sampleData[,"tissueType"]=="primary colorectal cancer",])
# 在normal組織中的raw counts
rawCounts["ENSG00000183034",normals]
# 在primary tumor中的raw counts
rawCounts["ENSG00000183034",primaries]

熱圖怎么畫?

基礎(chǔ)熱圖

繪制代碼如下,

# 將DESeq對(duì)象進(jìn)行variance stablilizing transform
deseq2VST <- vst(deseq2Data)

# 再將結(jié)果轉(zhuǎn)換為dataframe
deseq2VST <- assay(deseq2VST)
deseq2VST <- as.data.frame(deseq2VST)
# 獲取gene ID
deseq2VST$Gene <- rownames(deseq2VST)
# head(deseq2VST)

Note:assay()的作用?這邊就不展開將了,先給自己挖一個(gè)坑,以后會(huì)寫一篇關(guān)于這個(gè)的實(shí)戰(zhàn)。

學(xué)習(xí)資料先放在這里:https://bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html

# 只保留log2FoldChaneg > 3的差異表達(dá)基因
sigGenes <- rownames(deseq2ResDF[deseq2ResDF$padj <= .05 & abs(deseq2ResDF$log2FoldChange) > 3,])
deseq2VST <- deseq2VST[deseq2VST$Gene %in% sigGenes,]

# 數(shù)據(jù)類型轉(zhuǎn)換:width to long
library(reshape2)
deseq2VST_wide <- deseq2VST
deseq2VST_long <- melt(deseq2VST, id.vars=c("Gene"))

# 覆蓋原始VST數(shù)據(jù)
deseq2VST <- melt(deseq2VST, id.vars=c("Gene"))

# 繪制熱圖
# 以樣本作為x,gene ID作為y,填充值映射到fill
heatmap <- ggplot(deseq2VST, aes(x=variable, y=Gene, fill=value)) + 
    geom_raster() + 
    scale_fill_viridis(trans="sqrt") +      
    theme(axis.text.x=element_text(angle=65, hjust=1), 
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank())
# heatmap

使用的函數(shù)為

添加聚類樹的熱圖

# 將長(zhǎng)數(shù)據(jù)類型轉(zhuǎn)換為寬數(shù)據(jù)類型
deseq2VSTMatrix <- dcast(deseq2VST, Gene ~ variable)
rownames(deseq2VSTMatrix) <- deseq2VSTMatrix$Gene
deseq2VSTMatrix$Gene <- NULL

# 計(jì)算距離,沒印象的看看多元統(tǒng)計(jì)
# 以下分別基于gene和sample進(jìn)行了距離計(jì)算
distanceGene <- dist(deseq2VSTMatrix)
distanceSample <- dist(t(deseq2VSTMatrix))  # 轉(zhuǎn)置后再計(jì)算

# 根據(jù)上一步計(jì)算得到的距離進(jìn)行聚類分析
clusterGene <- hclust(distanceGene, method="average")
clusterSample <- hclust(distanceSample, method="average")

# 構(gòu)建基于sample的聚類樹
# install.packages("ggdendro")
library(ggdendro)
sampleModel <- as.dendrogram(clusterSample)
sampleDendrogramData <- segment(dendro_data(sampleModel, 
                                            type = "rectangle"))
sampleDendrogram <- ggplot(sampleDendrogramData) + 
    geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) + 
    theme_dendro()

# 將sample聚類結(jié)果添加到原始數(shù)據(jù)集中
deseq2VST$variable <- factor(deseq2VST$variable, 
                  levels=clusterSample$labels[clusterSample$order])

# 繪制聚類樹
library(gridExtra)
grid.arrange(sampleDendrogram, heatmap, ncol=1, heights=c(1,5))

需要額外用到的幾個(gè)R packages,

  • ggdendro
  • gridExtra

繪圖結(jié)果如下,


但是上述添加到原始圖層上的樹,和heatmap中sample所對(duì)應(yīng)的位置實(shí)際上是不一樣的,

精修代碼如下,

library(gtable)
library(grid)

# Modify the ggplot objects
sampleDendrogram_1 <- sampleDendrogram + scale_x_continuous(expand=c(.0085, .0085)) + scale_y_continuous(expand=c(0, 0))
heatmap_1 <- heatmap + scale_x_discrete(expand=c(0, 0)) + scale_y_discrete(expand=c(0, 0))

# Convert both grid based objects to grobs
sampleDendrogramGrob <- ggplotGrob(sampleDendrogram_1)
heatmapGrob <- ggplotGrob(heatmap_1)

# Check the widths of each grob
sampleDendrogramGrob$widths
heatmapGrob$widths

# Add in the missing columns
sampleDendrogramGrob <- gtable_add_cols(sampleDendrogramGrob, heatmapGrob$widths[7], 6)
sampleDendrogramGrob <- gtable_add_cols(sampleDendrogramGrob, heatmapGrob$widths[8], 7)

# Make sure every width between the two grobs is the same
maxWidth <- unit.pmax(sampleDendrogramGrob$widths, heatmapGrob$widths)
sampleDendrogramGrob$widths <- as.list(maxWidth)
heatmapGrob$widths <- as.list(maxWidth)

# Arrange the grobs into a plot
finalGrob <- arrangeGrob(sampleDendrogramGrob, heatmapGrob, ncol=1, heights=c(2,5))

# Draw the plot
grid.draw(finalGrob)



# Re-order the sample data to match the clustering we did
sampleData_v2$Run <- factor(sampleData_v2$Run, levels=clusterSample$labels[clusterSample$order])

# Construct a plot to show the clinical data
colours <- c("#743B8B", "#8B743B", "#8B3B52")
sampleClinical <- ggplot(sampleData_v2, aes(x=Run, y=1, fill=Sample.Characteristic.biopsy.site.)) + 
    geom_tile() + 
    scale_x_discrete(expand=c(0, 0)) + 
    scale_y_discrete(expand=c(0, 0)) + 
    scale_fill_manual(name="Tissue", values=colours) +
    theme_void()

# Convert the clinical plot to a grob
sampleClinicalGrob <- ggplotGrob(sampleClinical)

# Make sure every width between all grobs is the same
maxWidth <- unit.pmax(sampleDendrogramGrob$widths, heatmapGrob$widths, sampleClinicalGrob$widths)
sampleDendrogramGrob$widths <- as.list(maxWidth)
heatmapGrob$widths <- as.list(maxWidth)
sampleClinicalGrob$widths <- as.list(maxWidth)

# Arrange and output the final plot
finalGrob <- arrangeGrob(sampleDendrogramGrob, sampleClinicalGrob, heatmapGrob, ncol=1, heights=c(2,1,5))
grid.draw(finalGrob)

參考資料

[1] https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容