《生物信息學(xué)生R入門教程》讀書筆記 Chapter 7

這一章來(lái)介紹NGS的下游分析

富集分析

簡(jiǎn)單的講,富集分析就是一種統(tǒng)計(jì)分析的手段,用來(lái)篩選功能相類的一組基因是否富集中差異表達(dá)的基因中。當(dāng)人們拿到了差異表達(dá)的基因時(shí),很多時(shí)候因?yàn)椴町惐磉_(dá)的基因數(shù)量很多,面對(duì)這么多的基因,人們不知道如何找到合適的突破口進(jìn)行下游的驗(yàn)證實(shí)驗(yàn)。于是從一堆差異表達(dá)的基因中找出有意義的基因進(jìn)行RT-PCR驗(yàn)證以及基因敲除驗(yàn)證就需要使用到富集分析了

選擇正確的庫(kù)(library, 或者說(shuō)是universe gene id)。正確的庫(kù)的大小的選擇決定著分析的結(jié)果是否可靠

1.GO富集

示例代碼:

library(clusterProfiler)
#加載人類的庫(kù)
library(org.Hs.eg.db)
data(geneList, package="DOSE")
#篩選genelist
gene <- names(geneList)[abs(geneList) > 2]
#轉(zhuǎn)換geneID
gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
#功能富集
ego <- enrichGO(gene          = gene,
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                readable      = TRUE)

2.KEGG

#kegg富集
kk <- enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)

熱圖分析

熱圖(heatmap)。熱圖是一種三維數(shù)據(jù)轉(zhuǎn)二維的表達(dá)手段,它通過(guò)使用不同的顏色來(lái)表達(dá)本應(yīng)該在第三維上顯示的數(shù)據(jù)。熱圖的方便的于,它可以在一張二維圖像上快速的通過(guò)顏色的差異來(lái)了解組內(nèi)的一致性及差異組間的差別
熱圖的軟件包有很多,常用的有pheatmap, ggplot2以及ComplexHeatmap

以ComplexHeatmap:

## 首先, 我們先生成一個(gè)隨機(jī)數(shù)據(jù)。
set.seed(1)
mat <- matrix(runif(36), nrow=6, 
              dimnames=list(paste("row", letters[1:6], sep="_"), 
                            paste("col", LETTERS[1:6], sep="_")))
library(ComplexHeatmap)

當(dāng)然ComplexHeatmap還有一些高級(jí)用法
生成兩個(gè)熱圖,并對(duì)行列進(jìn)行注釋:

ha_column <- 
  HeatmapAnnotation(
    df = data.frame(## data.frame, 行數(shù)與mat的列數(shù)一致。
      type1 = sample(c("a", "b"), 
                     ncol(mat), 
                     replace = TRUE)),
    col = list(type1 = c("a" = "red", "b" = "blue")) 
    ## 顏色list, 名字與df中的列字一致,
    ##           元素名與df列中的出現(xiàn)的元素一致。
)
ha_row <- rowAnnotation( ## 對(duì)行進(jìn)行注釋
  df = data.frame(type2 = sample(c("A", "B"),
                                 nrow(mat), 
                                 replace = TRUE)),
  col = list(type2 = c("A" = "green", "B" = "cyan")),
  width = unit(1, "cm") ##指定寬度,必須是unit,參見grid::unit
)

## 生成兩個(gè)Heatmap實(shí)例。這里我們使用了相同的數(shù)據(jù)。
## 在實(shí)際應(yīng)用中,它應(yīng)該是行數(shù)一致的不同數(shù)據(jù)。
ht1 <- Heatmap(mat, name = "ht1", #name將出現(xiàn)的圖例中
               column_title = "Heatmap 1", #title將出現(xiàn)在標(biāo)題中
               top_annotation = ha_column) #ha_column將出現(xiàn)在熱圖上方(也可以是下方)
ht2 <- Heatmap(mat, name = "ht2", 
               column_title = "Heatmap 2")
ht_list <- ht1 + ht2 + ha_row
draw(ht_list)
#把圖例調(diào)來(lái)左邊
draw(ht_list, heatmap_legend_side = "left", 
     show_annotation_legend = FALSE)

當(dāng)然利用HeatmapAnnotation()函數(shù)來(lái)進(jìn)行加工也可以,可以在熱圖的基礎(chǔ)上加點(diǎn)圖,條形圖
1.row_anno_points():點(diǎn)圖
2.row_anno_barplot():條形圖
3.row_anno_boxplot():箱線圖
4.row_anno_histogram():直方圖
5.row_anno_density():密度圖

ha_mix_top <- HeatmapAnnotation(
  points = anno_points(runif(ncol(mat))), ## points
  barplot = anno_barplot(runif(ncol(mat)), axis=TRUE), ## barplot
  histogram = anno_histogram(mat, 
                             gp = gpar(fill = 1:6)), ## histogram
  density_line = anno_density(mat, type = "line", 
                              gp = gpar(col = 1:6)), ## density
  violin = anno_density(mat, type = "violin", gp = gpar(fill = 6:1)), ## violin
  box = anno_boxplot(mat), ## boxplot
  heatmap = anno_density(mat, type = "heatmap"), ## heatmap
  annotation_height = unit(8, "cm") ## annotation height
)
Heatmap(mat, name = "mat", 
        top_annotation = ha_mix_top)

如果想只對(duì)個(gè)別的行或者列標(biāo)示名稱,其它的不標(biāo)注

mat = matrix(rnorm(10000), nr = 1000)
rownames(mat) = sprintf("%.2f", rowMeans(mat))
subset = sample(1000, 20)
labels = rownames(mat)[subset]
Heatmap(mat, show_row_names = FALSE, show_row_dend = FALSE, show_column_dend = FALSE) + 
rowAnnotation(link = row_anno_link(at = subset, labels = labels),
  width = unit(1, "cm") + max_text_width(labels))

突變分析

常用于TCGA分析數(shù)據(jù)
我們先構(gòu)建對(duì)象

mat <- read.table(textConnection(
",s1,s2,s3
g1,snv1;indel,snv1;snv2,indel
g2,,snv2;indel,snv1
g3,snv1,,indel;snv1;snv2"), 
row.names = 1, header = TRUE, 
sep = ",", stringsAsFactors = FALSE)
mat <- as.matrix(mat)
mat

數(shù)據(jù)結(jié)構(gòu)長(zhǎng)這樣


畫熱圖來(lái)查看各種突變的占比:

library(ComplexHeatmap)
col <- c(snv1 = "#008000", snv2 = "#008000", indel = "blue")
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
    alter_fun = list(
      indel = function(x, y, w, h) 
        grid.rect(x, y, w*0.9, h*0.9, 
                  gp = gpar(fill = col["indel"], 
                            col = NA)) , #must be draw first
        snv1 = function(x, y, w, h) 
          grid.rect(x, y-.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv1"], 
                              col = NA)),
        snv2 = function(x, y, w, h) 
          grid.rect(x, y+.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv2"], 
                              col = NA))
    ), col = col)

上面的例子有兩個(gè)重要的概念,一個(gè)是get_type,一個(gè)是alter_fun。兩者都接受一個(gè)函數(shù),其中g(shù)et_type是如何將mat中的每個(gè)單元如何分解成突變類型,alter_fun是如何對(duì)每一種突變類型進(jìn)行繪圖。 get_type接受一個(gè)參數(shù)x,這個(gè)x將會(huì)是mat中的每一個(gè)值。比如這里的mat[g1, s1]就是"snv1;indel"。而alter_fun將會(huì)是一個(gè)list of function,list中的每個(gè)元素名稱都將是可能出現(xiàn)在get_type得到的值中。list中的每一個(gè)函數(shù)有四個(gè)參數(shù),x,y,w,h,它們分別指heatmap中對(duì)應(yīng)mat中的每個(gè)單元格的x, y的坐標(biāo)以及單元格的高度和寬度。

接下來(lái)我們看下如何畫瀑布圖:

#建立對(duì)象
mat <- read.delim(file.path(system.file("extdata", "tcga_lung_adenocarcinoma_provisional_ras_raf_mek_jnk_signalling.txt",
                                       package = "ComplexHeatmap")), 
                 stringsAsFactors=FALSE, row.names=1)
mat[is.na(mat)] <- "  "
mat <-  mat[, -ncol(mat)]
mat = t(as.matrix(mat))

##假設(shè)有三種突變類型“MUT”,“AMP”,“HOMDEL”
#寫出類型
alter_fun <- list(
    background = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "#CCCCCC", col = NA))
    },
    HOMDEL = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "blue", col = NA))
    },
    AMP = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "red", col = NA))
    },
    MUT = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h*0.33, gp = gpar(fill = "#008000", col = NA))
    }
)
#定義顏色
col = c("MUT" = "#008000", "AMP" = "red", "HOMDEL" = "blue")
#讀取行名信息
sample_order <- scan(system.file("extdata", "sample_order.txt",
                                package = "ComplexHeatmap"), 
                    what = "character")
#構(gòu)建對(duì)象
ht <- oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
                alter_fun = alter_fun, col = col, 
                row_order = NULL, column_order = sample_order,
                remove_empty_columns = TRUE,
                column_title = "OncoPrint for TCGA Lung Adenocarcinoma, genes in Ras Raf MEK JNK signalling",
                heatmap_legend_param = 
                  list(title = "Alternations", 
                       at = c("AMP", "HOMDEL", "MUT"), 
                       labels = c("Amplification", 
                                  "Deep deletion", 
                                  "Mutation"),
                       nrow = 1, title_position = "leftcenter"))
#畫圖
draw(ht, heatmap_legend_side = "bottom")


這里有另外一位大佬寫的,挺不錯(cuò)的,歡迎大家參閱
http://www.itdecent.cn/p/deb17d0bcd58

基因網(wǎng)絡(luò)

在實(shí)驗(yàn)中,我們常常會(huì)將RNAseq的結(jié)果和ChIPseq的結(jié)果結(jié)合起來(lái),以利于提高下游的驗(yàn)證率,在這個(gè)過(guò)程中,可以使用GeneNetworkBuilder來(lái)查找高驗(yàn)證率的目標(biāo)基因。GeneNetworkBuilder需要兩個(gè)實(shí)驗(yàn)輸入,一個(gè)是全基因 組的表達(dá)圖譜,一個(gè)是目標(biāo)蛋白的目標(biāo)基因。GeneNetworkbuilder還需要該物種的調(diào)控網(wǎng)絡(luò)數(shù)據(jù),該數(shù)據(jù)是為了聯(lián)接實(shí)驗(yàn)數(shù)據(jù)迷失的目標(biāo)基因。

library(GeneNetworkBuilder)
library(simpIntLists)
i <- findInteractionList("human", "Official")
i <- lapply(i, function(.ele) cbind(from=as.character(.ele$name), to=as.character(.ele$interactors)))
i <- do.call(rbind, i)
set.seed(123)
## generate a random ChIP-seq binding table
rootgene <- sample(i[, 1], 1)
TFbindingTable <- i[i[, 1] == rootgene, ]
interactionmap <- i
# build network
sifNetwork<-buildNetwork(TFbindingTable=TFbindingTable, 
                        interactionmap=interactionmap, level=2)
ID=unique(as.character(sifNetwork))
## create a random expression data
expressionData <- data.frame(ID=ID,
                             logFC=sample(-3:3, length(ID), replace=TRUE),
                             P.Value=runif(n=length(ID), max=0.25))
## filter network
cifNetwork<-filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork, 
                    exprsData=expressionData, mergeBy="ID",
                    miRNAlist=character(0), 
                    tolerance=1, cutoffPVal=0.01, cutoffLFC=1)
## polish network
gR<-polishNetwork(cifNetwork)
## browse network
# browseNetwork(gR) # interactive htmlwidget; 
## or plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
    if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
    if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
        stop("layouttype must be dot, neato, twopi, circo or fdp")
    }
    g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
    nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
    nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
    renderGraph(g1)
}
plotNetwork(gR)

這個(gè)包的官方文檔:
http://www.bioconductor.org/packages/release/bioc/vignettes/GeneNetworkBuilder/inst/doc/GeneNetworkBuilder_vignettes.html

查看Chip-seq信號(hào)強(qiáng)度

當(dāng)拿到ChIPseq的結(jié)果后,我們可以使用眾多手段來(lái)查看reads的真實(shí)情況,比如說(shuō)使用IGV, UCSC genome browser等。但是,當(dāng)大家需要把這個(gè)track生成圖片發(fā)表時(shí),這些工具提供的圖片輸出有時(shí)候無(wú)法達(dá)到發(fā)表的要求,于是很多軟件包就因此而生。最常使用到的軟件包是Gviz。然而,我想大家也看出來(lái)了,本文夾私的現(xiàn)象非常嚴(yán)重,所以,我重點(diǎn)介紹一下trackViewer軟件包。
trackViewer軟件包不但可以輸出高品質(zhì)的track圖,還可以生成高顏值的lollipop plot。甚至,還在推廣一種名為Dandelion plot的表示SNP以及Indel數(shù)據(jù)的可視化圖形。

library(trackViewer)
gr <- parse2GRanges("chr3:108,476,000-108,485,000")
gr # interesting genomic locations

library(GenomicFeatures)# load GenomicFeatures can create TxDb from UCSC
if(interactive()){
    mm8KG <- makeTxDbFromUCSC(genome="mm8", tablename="knownGene")
    saveDb(mm8KG, "mm8KG.sqlite")
}else{## mm8KG was saved as sqlite file
    mm8KG <- loadDb("data/mm8KG.sqlite")
}
library(org.Mm.eg.db) # load annotation database

## create the gene model tracks information
trs <- geneModelFromTxdb(mm8KG, org.Mm.eg.db, gr=gr)
## import data from bedGraph/bigWig/BED ... files, see ?importScore for details
CLIP <- importScore("data/CLIP.bedGraph", format="bedGraph", ranges=gr)
#bedGraph文件格式
control <- importScore("data/control.bedGraph", format="bedGraph", ranges=gr)
knockdown <- importScore("data/knockdown.bedGraph", format="bedGraph", ranges=gr)
## create styles by preset theme
optSty <- optimizeStyle(trackList(trs, knockdown, control, CLIP), theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
## adjust the styles for this track
### rename the trackList for each track
names(trackList)[1:2] <- paste0("Sort1: ", names(trackList)[1:2])
names(trackList)[3] <- "RNA-seq TDP-43 KD"
names(trackList)[4] <- "RNA-seq control"
### change the lab positions for gene model track to bottomleft
setTrackStyleParam(trackList[[1]], "ylabpos", "bottomleft")
setTrackStyleParam(trackList[[2]], "ylabpos", "bottomleft")
### change the color of gene model track
setTrackStyleParam(trackList[[1]], "ylabgp", list(cex=1, col="red"))
setTrackStyleParam(trackList[[2]], "ylabgp", list(cex=1, col="green"))
### remove the xaxis
setTrackViewerStyleParam(viewerStyle, "xaxis", FALSE)
### add a scale bar in CLIP track
setTrackXscaleParam(trackList[[5]], "draw", TRUE)
## plot the tracks
vp <- viewTracks(trackList, gr=gr, viewerStyle=viewerStyle)
### add guide lines to show the range of CLIP-seq signal
addGuideLine(c(108481252, 108481887), vp=vp)
### add arrow mark to show the alternative splicing event
addArrowMark(list(x=c(108483570, 108483570), 
                  y=c(3, 4)), 
             label=c("Inclusive\nexon", ""), 
             col=c("blue", "cyan"), 
             vp=vp, quadrant=1)

trackViewer官方文檔:
http://www.bioconductor.org/packages/release/bioc/vignettes/trackViewer/inst/doc/trackViewer.html

motif分析

motif一般利用MEME Suite或者Homer來(lái)進(jìn)行motif搜索,然后利用motifStack包進(jìn)行可視化
motifStack官方文檔:
http://www.bioconductor.org/packages/release/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html

library(motifStack)
pcm <- read.table(file.path(find.package("motifStack"), 
                            "extdata", "bin_SOLEXA.pcm"))
pcm <- pcm[,3:ncol(pcm)]
rownames(pcm) <- c("A","C","G","T")
motif <- new("pcm", mat=as.matrix(pcm), name="bin_SOLEXA")
plot(motif)
## plot a RNA sequence logo
rna <- pcm
rownames(rna)[4] <- "U"
motif <- new("pcm", mat=as.matrix(rna), name="RNA_motif")
plot(motif)

其中,這個(gè)包:https://www.plob.org/article/15953.html也有提到用法

或者我之前推送的:http://www.itdecent.cn/p/e7983c461635也有部分講解

其中pcm結(jié)尾的文件是個(gè)打分矩陣,可在JASPAR數(shù)據(jù)庫(kù) (http://jaspar.genereg.net/) 下載

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容