
當(dāng)我第一眼看到時(shí),驚艷到了?。?!如同登徒子見了久違的天人。與ComplexHeatmap包開發(fā)者為同一人,都是顧大神的杰作。
將原文摘錄如下:
此圖展現(xiàn)的是來(lái)自Roadmap數(shù)據(jù)集的表觀遺傳信號(hào)的整體關(guān)聯(lián)。以下是這些復(fù)雜關(guān)系的方法和配置,此文重點(diǎn)放在可視化部分,數(shù)據(jù)處理簡(jiǎn)要介紹。
(1)加載R包和預(yù)處理的R對(duì)象
library(EnrichedHeatmap)
library(GetoptLong)
library(circlize)
library(RColorBrewer)
download.file("https://jokergoo.github.io/supplementary/EnrichedHeatmap-supplementary/roadmap_normalized_matrices.RData",
destfile = "roadmap_normalized_matrices.RData")
load("roadmap_normalized_matrices.RData")
file.remove("roadmap_normalized_matrices.RData")
(2) 數(shù)據(jù)概況
Roadmap數(shù)據(jù)集(http://egg2.wustl.edu/roadmap/web_portal/) 涵蓋了不同的人類細(xì)胞類型和組織,并已統(tǒng)一處理過(guò)。Roadmap包含DNA甲基化、基因表達(dá)的RNA測(cè)序和不同組蛋白修飾的ChIP測(cè)序數(shù)據(jù)。
如我們所見,相比于組蛋白修飾,基因表達(dá)與DNA甲基化表現(xiàn)出強(qiáng)的關(guān)聯(lián),因而整體的整合分析以基因表達(dá)和甲基化為中心,關(guān)聯(lián)其它的組蛋白修飾信號(hào)。也就是先看基因表達(dá)和甲基化關(guān)聯(lián)的區(qū)域,然后再在這些關(guān)聯(lián)的區(qū)域檢測(cè)組蛋白修飾信號(hào)是關(guān)聯(lián)還是不關(guān)聯(lián)。
我們僅用了Roadmap中的27個(gè)樣本,它們高質(zhì)量的匹配了表達(dá)和甲基化數(shù)據(jù),并在表達(dá)和甲基化數(shù)據(jù)上都有一致的亞群。
SAMPLE包含了樣本的注釋信息,而COLOR表示注釋的對(duì)應(yīng)的顏色。
SAMPLE
## id group sample_type subgroup
## E016 E016 ESC PrimaryCulture subgroup1
## E003 E003 ESC PrimaryCulture subgroup1
## E024 E024 ESC PrimaryCulture subgroup1
## E007 E007 ES-deriv ESCDerived subgroup1
## E013 E013 ES-deriv ESCDerived subgroup1
## E012 E012 ES-deriv ESCDerived subgroup1
## E011 E011 ES-deriv ESCDerived subgroup1
## E004 E004 ES-deriv ESCDerived subgroup1
## E005 E005 ES-deriv ESCDerived subgroup1
## E006 E006 ES-deriv ESCDerived subgroup1
## E050 E050 HSC_&_B-cell PrimaryCell subgroup2
## E112 E112 Thymus PrimaryTissue subgroup2
## E071 E071 Brain PrimaryTissue subgroup2
## E100 E100 Muscle PrimaryTissue subgroup2
## E104 E104 Heart PrimaryTissue subgroup2
## E095 E095 Heart PrimaryTissue subgroup2
## E105 E105 Heart PrimaryTissue subgroup2
## E065 E065 Heart PrimaryTissue subgroup2
## E109 E109 Digestive PrimaryTissue subgroup2
## E106 E106 Digestive PrimaryTissue subgroup2
## E079 E079 Digestive PrimaryTissue subgroup2
## E094 E094 Digestive PrimaryTissue subgroup2
## E097 E097 Other PrimaryTissue subgroup2
## E066 E066 Other PrimaryTissue subgroup2
## E098 E098 Other PrimaryTissue subgroup2
## E096 E096 Other PrimaryTissue subgroup2
## E113 E113 Other PrimaryTissue subgroup2
COLOR
## $group
## ESC ES-deriv HSC_&_B-cell Thymus Brain Muscle Heart
## "#924965" "#4178AE" "#678C69" "#DAB92E" "#C5912B" "#C2655D" "#D56F80"
## Digestive Other
## "#C58DAA" "#999999"
##
## $sample_type
## PrimaryCulture ESCDerived PrimaryCell PrimaryTissue
## "#8cd2c7" "#bfb9db" "#faf7b4" "#f57f73"
##
## $subgroup
## subgroup1 subgroup2
## "#A6CEE3" "#1F78B4"
27個(gè)樣本被分成了兩個(gè)亞群,一個(gè)是胚胎干細(xì)胞,另一個(gè)是其它(包括原代組織和成熟細(xì)胞)。這兩個(gè)亞群的樣本有截然不同的表達(dá)和甲基化數(shù)據(jù)集。
在此,我們?cè)诨騎SS上游5kb和下游10kb附近富集的不同的表觀基因組信息進(jìn)行可視化。所有的表觀基因組信號(hào)都已經(jīng)歸一化,儲(chǔ)存在R對(duì)象中。這是一些對(duì)象的定義:
- mat_neg_cr: a normalized matrix for correlated regions (CRs, The definition of CR is introduced in following sections) showing significant negative correlation between methylation and gene expression. The value in the matrix is whether a window is covered by negative CRs (values are 0 or 1).
- meth_mat_corr: a normalized matrix for all CRs. The value in the matrix is the mean correlation for CRs overlapped to a window.
- meth_mat_mean: a normalized matrix for mean methylation across all samples.
- meth_mat_diff: a normalized matrix for mean methylation difference between two subgroups.
- mat_cgi: a normalized matrix for CpG islands. The value in the matrix is whether a window is covered by CGIs.
- hist_mat_corr_list: a list of normalized matrices for the correlation between histone modification signals and gene expression. Each matrix corresponds to one type of histone modification.
- hist_mat_mean_list: a list of normalized matrices for the mean histone modification signals across all samples.
- hist_mat_diff_list: a list of normalized matrices for the histone modification signal difference between two subgroups.
Other R objects are
- gene: A GRanges object which contains positions of genes.
- tss: A GRanges object which contains positions of gene TSS.
- gene_symbol a mapping between Ensembl IDs and gene symbols.
- expr: gene expression matrix (Values are measured by log2(RPKM + 1)).
(3)對(duì)基因進(jìn)行排序和分類
先根據(jù)表達(dá)量分成高低兩類:
expr_mean = rowMeans(expr[, SAMPLE$subgroup == "subgroup1"]) -
rowMeans(expr[, SAMPLE$subgroup == "subgroup2"])
expr_split = ifelse(expr_mean > 0, "high", "low")
expr_split = factor(expr_split, levels = c("high", "low"))
然后根據(jù)甲基化提取TSS上游20%和下游40%的兩類:
set.seed(123)
upstream_index = length(attr(meth_mat_mean, "upstream_index"))
meth_split = kmeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))],
centers = 2)$cluster
x = tapply(rowMeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))]),
meth_split, mean)
od = structure(order(x), names = names(x))
meth_split = paste0("cluster", od[as.character(meth_split)])
把表達(dá)和甲基化的基因信息放到一起:
combined_split = paste(meth_split, expr_split, sep = "|")
tb = table(combined_split)
tb
## combined_split
## cluster1|high cluster1|low cluster2|high cluster2|low
## 306 940 25 561
tb["cluster2|high"]/sum(tb)
## cluster2|high
## 0.01364629
去掉cluster2|high 組
l = combined_split != "cluster2|high"
tss = tss[l]
expr = expr[l, ]
hist_mat_corr_list = lapply(hist_mat_corr_list, function(x) x[l, ])
hist_mat_mean_list = lapply(hist_mat_mean_list, function(x) x[l, ])
hist_mat_diff_list = lapply(hist_mat_diff_list, function(x) x[l, ])
mat_neg_cr = mat_neg_cr[l, ]
mat_cgi = mat_cgi[l, ]
meth_mat_corr = meth_mat_corr[l, ]
meth_mat_mean = meth_mat_mean[l, ]
meth_mat_diff = meth_mat_diff[l, ]
expr_split = expr_split[l]
meth_split = meth_split[l]
combined_split = combined_split[l]
n_row_cluster = length(unique(combined_split))
計(jì)算基因的行序:
merge_row_order = function(l_list) {
do.call("c", lapply(l_list, function(l) {
if(sum(l) == 0) return(integer(0))
if(sum(l) == 1) return(which(l))
dend1 = as.dendrogram(hclust(dist_by_closeness(mat_neg_cr[l, ])))
dend1 = reorder(dend1, -enriched_score(mat_neg_cr[l, ]))
od = order.dendrogram(dend1)
which(l)[od]
}))
}
row_order = merge_row_order(list(
combined_split == "cluster1|high",
combined_split == "cluster1|low",
combined_split == "cluster2|low"
))
組織熱圖
表達(dá)熱圖
dend1 = as.dendrogram(hclust(dist(t(expr[, SAMPLE$subgroup == "subgroup1"]))))
hc1 = as.hclust(reorder(dend1, colMeans(expr[, SAMPLE$subgroup == "subgroup1"])))
expr_col_od1 = hc1$order
dend2 = as.dendrogram(hclust(dist(t(expr[, SAMPLE$subgroup == "subgroup2"]))))
hc2 = as.hclust(reorder(dend2, colMeans(expr[, SAMPLE$subgroup == "subgroup2"])))
expr_col_od2 = hc2$order
expr_col_od = c(which(SAMPLE$subgroup == "subgroup1")[expr_col_od1],
which(SAMPLE$subgroup == "subgroup2")[expr_col_od2])
把樣本注釋放到頂端:
ht_list = Heatmap(expr, name = "expr", show_row_names = FALSE,
show_column_names = FALSE, width = unit(4, "cm"), show_column_dend = FALSE,
cluster_columns = FALSE, column_order = expr_col_od,
top_annotation = HeatmapAnnotation(df = SAMPLE[, -1], col = COLOR,
annotation_name_side = "left"),
column_title = "Expression", column_title_gp = gpar(fontsize = 12),
show_row_dend = FALSE, use_raster = TRUE)
提取用T檢驗(yàn)最顯著前20個(gè)基因
library(genefilter)
df = rowttests(expr, factor(SAMPLE$subgroup))
top_genes = rownames(df[order(df$p.value)[1:20], ])
為這些基因添加文本注釋
index = which(rownames(expr) %in% top_genes)
labels = gene_symbol[rownames(expr)[index]]
ht_list = rowAnnotation(sig_gene = anno_mark(at = index, labels = labels,
side = "left", labels_gp = gpar(fontsize = 10),
extend = unit(c(1, 0), "cm"))) + ht_list
右邊添加基因長(zhǎng)度的行注釋
gl = width(gene[names(tss)])
gl[gl > quantile(gl, 0.95)] = quantile(gl, 0.95)
ht_list = ht_list + rowAnnotation(gene_len = anno_points(gl, size = unit(1, "mm"),
gp = gpar(col = "#00000040"),
axis_param = list(at = c(0, 1e5, 2e5), labels = c("0bp", "100bp", "200bp")),
width = unit(1.5, "cm")))
加入表明CGI到TSS富集的熱圖
axis_name = c("-5kb", "TSS", "10kb")
ht_list = ht_list + EnrichedHeatmap(mat_cgi, col = c("white", "darkorange"), name = "CGI",
column_title = "CGI", column_title_gp = gpar(fontsize = 12),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "darkorange",
lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
加入甲基化和表達(dá)的熱圖
bg_col = brewer.pal(8, "Set2")
cor_col_fun = colorRamp2(c(-1, 0, 1), c("darkgreen", "white", "red"))
ht_list = ht_list + EnrichedHeatmap(meth_mat_corr, col = cor_col_fun, name = "meth_corr",
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "red",
neg_col = "darkgreen", lty = 1:n_row_cluster),
axis_param = list(side = "right", facing = "inside"))),
column_title = "meth_corr", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
添加表明所有樣本平均甲基化的熱圖
meth_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
ht_list = ht_list + EnrichedHeatmap(meth_mat_mean, col = meth_col_fun, name = "meth_mean",
column_title = "meth_mean", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "red",
lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
定義函數(shù)為陽(yáng)性和陰性差異的亞群添加對(duì)稱的顏色
generate_diff_color_fun = function(x) {
q = quantile(x, c(0.05, 0.95))
max_q = max(abs(q))
colorRamp2(c(-max_q, 0, max_q), c("#3794bf", "#FFFFFF", "#df8640"))
}
ht_list = ht_list + EnrichedHeatmap(meth_mat_diff, name = "meth_diff",
col = generate_diff_color_fun(meth_mat_diff),
column_title = "meth_diff", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "#df8640",
neg_col = "#3794bf", lty = 1:n_row_cluster),
axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
17個(gè)熱圖太長(zhǎng)了,分成兩行。
ht_list_2 = NULL
ht_list_1 = NULL
mark_name = names(hist_mat_corr_list)
for(i in seq_along(hist_mat_corr_list)) {
# heatmaps for the 2nd, 3th and 4th histone modifications are assigned to a new `ht_list`
if(i == 2) {
ht_list_1 = ht_list
ht_list = NULL
}
ht_list = ht_list + EnrichedHeatmap(hist_mat_corr_list[[i]], col = cor_col_fun,
name = qq("@{mark_name[i]}_corr"), column_title = qq("@{mark_name[i]}_corr"),
column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "red",
neg_col = "darkgreen", lty = 1:n_row_cluster),
axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
ht_list = ht_list + EnrichedHeatmap(hist_mat_mean_list[[i]],
col = colorRamp2(c(0, quantile(hist_mat_mean_list[[i]], 0.95)), c("white", "purple")),
name = qq("@{mark_name[i]}_mean"), column_title = qq("@{mark_name[i]}_mean"),
column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "purple",
lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
ht_list = ht_list + EnrichedHeatmap(hist_mat_diff_list[[i]],
col = generate_diff_color_fun(hist_mat_diff_list[[i]]),
name = qq("@{mark_name[i]}_diff"), column_title = qq("@{mark_name[i]}_diff"),
column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]),
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "#df8640",
neg_col = "#3794bf", lty = 1:n_row_cluster),
axis_param = list(side = "right", facing = "inside"))),
axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
}
ht_list_2 = ht_list
為兩個(gè)熱圖定義同樣的分類和排序
split = as.vector(combined_split)
split[combined_split == "cluster1|high"] = "cluster1"
split[combined_split == "cluster1|low"] = "cluster2"
split[combined_split == "cluster2|low"] = "cluster3"
對(duì)每個(gè)熱圖,最左邊添加一列熱圖代表各亞群的差異表達(dá)。
ht_list_1 = Heatmap(expr_split, show_row_names = FALSE, name = "expr_diff",
col = c("high" = "red", "low" = "darkgreen"),
show_column_names = FALSE, width = unit(2, "mm")) + ht_list_1
繪制熱圖
ht_list_1 = draw(ht_list_1,
cluster_rows = FALSE, row_order = row_order, show_row_dend = FALSE,
row_split = split, heatmap_legend_side = "bottom", ht_gap = unit(2, "mm"))
add_boxplot_of_gene_length = function(ht_list) {
row_order_list = row_order(ht_list)
lt = lapply(row_order_list, function(ind) gl[ind])
bx = boxplot(lt, plot = FALSE)$stats
n = length(row_order_list)
decorate_annotation("gene_len", slice = 1, {
rg = range(bx)
rg[1] = rg[1] - (rg[2] - rg[1])*0.1
rg[2] = rg[2] + (rg[2] - rg[1])*0.1
pushViewport(viewport(y = unit(1, "npc") + unit(1, "mm"), just = "bottom",
height = unit(2, "cm"), yscale = rg, xscale = c(0.5, n + 0.5)))
grid.rect()
for(i in 1:n) {
grid.boxplot(pos = i, lt[[i]], gp = gpar(lty = i), outline = FALSE)
}
grid.text("Gene length", y = unit(1, "npc") + unit(2.5, "mm"),
gp = gpar(fontsize = 12), just = "bottom")
upViewport()
})
}
add_boxplot_of_gene_length(ht_list_1)
ht_list_2 = Heatmap(expr_split, show_row_names = FALSE, name = "expr_diff",
col = c("high" = "red", "low" = "darkgreen"),
show_column_names = FALSE, width = unit(2, "mm")) + ht_list_2
ht_list_2 = draw(ht_list_2,
cluster_rows = FALSE, row_order = row_order, show_row_dend = FALSE,
row_split = split, heatmap_legend_side = "bottom", ht_gap = unit(2, "mm"))
提取注釋圖形。
add_anno_enriched = function(ht_list, name, ri, ci) {
pushViewport(viewport(layout.pos.row = ri, layout.pos.col = ci))
extract_anno_enriched(ht_list, name, newpage = FALSE)
upViewport()
}
pushViewport(viewport(layout = grid.layout(nr = 3, nc = 6)))
add_anno_enriched(ht_list_1, "meth_corr", 1, 1)
add_anno_enriched(ht_list_1, "meth_mean", 1, 2)
add_anno_enriched(ht_list_1, "meth_diff", 1, 3)
add_anno_enriched(ht_list_1, "CGI", 1, 4)
add_anno_enriched(ht_list_1, "H3K4me3_corr", 2, 1)
add_anno_enriched(ht_list_1, "H3K4me3_mean", 2, 2)
add_anno_enriched(ht_list_1, "H3K4me3_diff", 2, 3)
add_anno_enriched(ht_list_2, "H3K4me1_corr", 2, 4)
add_anno_enriched(ht_list_2, "H3K4me1_mean", 2, 5)
add_anno_enriched(ht_list_2, "H3K4me1_diff", 2, 6)
add_anno_enriched(ht_list_2, "H3K27ac_corr", 3, 1)
add_anno_enriched(ht_list_2, "H3K27ac_mean", 3, 2)
add_anno_enriched(ht_list_2, "H3K27ac_diff", 3, 3)
add_anno_enriched(ht_list_2, "H3K27me3_corr", 3, 4)
add_anno_enriched(ht_list_2, "H3K27me3_mean", 3, 5)
add_anno_enriched(ht_list_2, "H3K27me3_diff", 3, 6)
upViewport()

圖形的解讀
Generally, genes in cluster 1 and 2 have high expression, long gene length (annotation “Gene length”) and low methylation over TSS (heatmap “meth_mean”) which correspond well with the enrichment of CpG islands over TSS (heatmap “CGI”), while genes in cluster 3 have low expression, short gene length, and intermediate mean methylation with almost none CGIs overlapping TSS. There is enrichment for significant negative CRs (negCRs) downstream of TSS in cluster 1 and cluster 2 (solid and dashed green lines in annotation of “meth_corr” heatmap, the peaks of the enrichment locate at approximately +2kb of TSS.) while for cluster 3 genes, the enrichment of negCRs is very close to TSS. By associating the heatmap “CGI”, “meth_corr”, “meth_mean” and “meth_diff” together, we can make the conclusion that for genes in cluster 1 and cluster 2, negCRs are enriched at the downstream border of CGI over TSS with high methylation variability, and even for cluster 3 genes, there is also a trend that the negCRs are enriched at close downstream of TSS. It might give hypotheses that when the transcription machine moves into the gene body from TSS, these exist some mechanism that blocks this process and reflects on the changes of methylations.
H3K4me3 is a histone mark which is enriched at active TSS or promoters. Heatmap “H3K4me3_mean” shows strong enrichment of the mean signal over TSS for cluster 1 and cluster 2 genes with high expression. Such enrichment corresponds very well to the low TSS-methylation. Interestingly, strong positive correlation to expression dominates in cluster 1 and the signals are significantly higher in embryonic cells (heatmap “H3K4me3_diff”). The peak for the enrichment of correlation signals in cluster 1 (solid red line in annotation of heatmap “H3K4me3_corr”) is broader than the mean signals while it is very similar as the enrichment peak for negCRs. For cluster 2 genes, the positive correlated regions are enriched at downstream border of H3K4me3 peaks while directly at the H3K4me3 peaks shows negative correlation although the correlation signals are weak and signal difference is small. Surprisingly, strong positive correlations dominate cluster 3 although the mean signals selves are very weak.
H3K4me1 is an active mark enriched at enhancers and promoter flanking regions. Nevertheless, it shows negative correlation at the TSS (solid and dashed green lines in annotation of heatmap “H3K4me1_corr”), especially strong for cluster 1. The peak for the negative correlation enrichment correlates well with CGI and low TSS-methylation, however the signals selves are low at TSS (heatmap “H3K4me1_mean”). Flanking TSS is dominated by positive correlations and the signal difference is comparably big in cluster 1 (solid brown line in annotation of heatmap “H3K4me1_diff”).
H3K27ac is also an active mark enriched in both active enhancers and promoters, and it generally shows positive correlations to expression in all three clusters (heatmap “H3K27ac_corr”). Interestingly the mean signals are the strongest in cluster 2 and mature cells have significantly higher signal intensity than embryonic cells (dashed blue line in annotation of heatmap “H3K27ac_diff”). The peak for the correlation signal enrichment is comparably broader than other marks.
H3K27me3 is a repressive mark and it generally shows negative correlation around TSS at relatively low level, excluding cluster 1 where there are no dominant correlation patterns (heatmap “H3K27me3_corr”). The signals selves are lower and sparser compared to other marks.