前言
Immugent最近在讀文獻(xiàn)時(shí)一直遇到各種利用deconvolution基于單細(xì)胞參考數(shù)據(jù)集進(jìn)行bulk細(xì)胞定量的文章,于是專門系統(tǒng)的學(xué)習(xí)了一下此類常用的幾個(gè)軟件,今天就做一個(gè)簡單的分享。
其實(shí)在之前IOBR:一個(gè)R包帶你走進(jìn)數(shù)據(jù)挖掘的殿堂一文中就已經(jīng)系統(tǒng)的介紹各種基于參考基因集進(jìn)行bulk細(xì)胞定量的算法,但是這些算法都是bulk對bulk,要么對基因的選擇要求很高,要么識別的細(xì)胞亞群有限。而近些年大量單細(xì)胞數(shù)據(jù)集的出現(xiàn),讓我們想到如果利用定義好的單個(gè)細(xì)胞水平的數(shù)據(jù)構(gòu)建reference,結(jié)合deconvolution對bulk數(shù)據(jù)進(jìn)行細(xì)胞定量,那豈不是美滋滋。
于是deconvolution的門面算法之一CIBERSORT為實(shí)現(xiàn)這個(gè)理念,就在2019年推出了新版CIBERSORTx,相應(yīng)文章發(fā)表在了Nature Biotec hnology (IF:54.9),這個(gè)雜志可謂生信甚至實(shí)驗(yàn)技術(shù)的天花板了,可見其價(jià)值之高。
就當(dāng)小編興致沖沖打算用CIBERSORTx玩一圈TCGA數(shù)據(jù)時(shí),出現(xiàn)了很多bug,因?yàn)槟壳斑@個(gè)軟件只有網(wǎng)頁版的工具,想要做分析必須將自己的數(shù)據(jù)傳上去,然鵝系統(tǒng)并不能識別新上傳的文件,就很難受了。雖然網(wǎng)頁友好的給出了鏡像文件,但是又搞了個(gè)什么需要用郵箱申請密鑰,于是小編等了好幾天都沒等到,干脆不等了。
于是小編開始尋求新的此類軟件,就發(fā)現(xiàn)了同年發(fā)表的另一個(gè)軟件MuSiC,雖然雜志沒有上一個(gè)好,但也算不錯(cuò)了,而且兩年內(nèi)被引用了幾百次了,可見這個(gè)算法還是值得信賴的。
最難能可貴的是這個(gè)軟件是一個(gè)非常容易安裝的R包,運(yùn)算也很快,非常的輕便。最給力的是它還可以對小鼠和大鼠的bulk數(shù)據(jù)進(jìn)行deconvolution,因此深得Immugent的喜愛,就走了一下分析流程。
主要流程
作者給出了以下幾套數(shù)據(jù)集(都是已經(jīng)發(fā)表的GE數(shù)據(jù)),以便于熟悉此軟件的用法。setwd("D:\\Music")
library(MuSiC)
# bulk data
GSE50244.bulk.eset = readRDS('GSE50244bulkeset.rds')
GSE50244.bulk.eset
# EMTAB single cell dataset
EMTAB.eset = readRDS('EMTABesethealthy.rds')
EMTAB.eset
# Download Xin et al. single cell dataset from Github
XinT2D.eset = readRDS('XinT2Deset.rds')
XinT2D.eset
# Estimate cell type proportions
Est.prop.GSE50244 = music_prop(bulk.eset = GSE50244.bulk.eset, sc.eset = EMTAB.eset, clusters = 'cellType',
samples = 'sampleID', select.ct = c('alpha', 'beta', 'delta', 'gamma',
'acinar', 'ductal'), verbose = F)
names(Est.prop.GSE50244)
#[1] "Est.prop.weighted" "Est.prop.allgene" "Weight.gene" "r.squared.full" "Var.prop"
# Jitter plot of estimated cell type proportions
jitter.fig = Jitter_Est(list(data.matrix(Est.prop.GSE50244$Est.prop.weighted),
data.matrix(Est.prop.GSE50244$Est.prop.allgene)),
method.name = c('MuSiC', 'NNLS'), title = 'Jitter plot of Est Proportions')
# A more sophisticated jitter plot is provided as below. We seperated the T2D subjects and normal
#subjects by their HbA1c levels.
library(reshape2)
m.prop.GSE50244 = rbind(melt(Est.prop.GSE50244$Est.prop.weighted),
melt(Est.prop.GSE50244$Est.prop.allgene))
colnames(m.prop.GSE50244) = c('Sub', 'CellType', 'Prop')
m.prop.GSE50244$CellType = factor(m.prop.GSE50244$CellType, levels = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))
m.prop.GSE50244$Method = factor(rep(c('MuSiC', 'NNLS'), each = 89*6), levels = c('MuSiC', 'NNLS'))
m.prop.GSE50244$HbA1c = rep(GSE50244.bulk.eset$hba1c, 2*6)
m.prop.GSE50244 = m.prop.GSE50244[!is.na(m.prop.GSE50244$HbA1c), ]
m.prop.GSE50244$Disease = factor(c('Normal', 'T2D')[(m.prop.GSE50244$HbA1c > 6.5)+1], levels = c('Normal', 'T2D'))
m.prop.GSE50244$D = (m.prop.GSE50244$Disease == 'T2D')/5
m.prop.GSE50244 = rbind(subset(m.prop.GSE50244, Disease == 'Normal'), subset(m.prop.GSE50244, Disease != 'Normal'))
jitter.new = ggplot(m.prop.GSE50244, aes(Method, Prop)) +
geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease),
size = 2, alpha = 0.7, position = position_jitter(width=0.25, height=0)) +
facet_wrap(~ CellType, scales = 'free') + scale_colour_manual( values = c('white', "gray20")) +
scale_shape_manual(values = c(21, 24))+ theme_minimal()
library(cowplot)
plot_grid(jitter.fig, jitter.new, labels = 'auto')
簡單一個(gè)函數(shù)即可實(shí)現(xiàn)基于單細(xì)胞數(shù)據(jù)對bulk進(jìn)行細(xì)胞定可謂非常輕便,除了使用自身算法進(jìn)行定量,MuSiC還嵌入了非負(fù)矩陣的算法來進(jìn)行對比,從上圖可以看出兩者間相似性非常高。
關(guān)聯(lián)分析
眾所周知,β細(xì)胞比例與T2D疾病狀態(tài)有關(guān)。在T2D過程中,β細(xì)胞數(shù)量減少。其中最重要的T2D檢測是糖化血紅蛋白(HbA1c)檢測。當(dāng)糖化血紅蛋白水平大于6.5%時(shí),診斷為T2D。我們可以看β細(xì)胞比例與糖化血紅蛋白水平的關(guān)系。
# Create dataframe for beta cell proportions and HbA1c levels
m.prop.ana = data.frame(pData(GSE50244.bulk.eset)[rep(1:89, 2), c('age', 'bmi', 'hba1c', 'gender')],
ct.prop = c(Est.prop.GSE50244$Est.prop.weighted[, 2],
Est.prop.GSE50244$Est.prop.allgene[, 2]),
Method = factor(rep(c('MuSiC', 'NNLS'), each = 89),
levels = c('MuSiC', 'NNLS')))
colnames(m.prop.ana)[1:4] = c('Age', 'BMI', 'HbA1c', 'Gender')
m.prop.ana = subset(m.prop.ana, !is.na(HbA1c))
m.prop.ana$Disease = factor( c('Normal', 'T2D')[(m.prop.ana$HbA1c > 6.5) + 1], c('Normal', 'T2D') )
m.prop.ana$D = (m.prop.ana$Disease == 'T2D')/5
ggplot(m.prop.ana, aes(HbA1c, ct.prop)) +
geom_smooth(method = 'lm', se = FALSE, col = 'black', lwd = 0.25) +
geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7) + facet_wrap(~ Method) +
ggtitle('HbA1c vs Beta Cell Type Proportion') + theme_minimal() +
scale_colour_manual( values = c('white', "gray20")) +
scale_shape_manual(values = c(21, 24))
lm.beta.MuSiC = lm(ct.prop ~ HbA1c + Age + BMI + Gender, data = subset(m.prop.ana, Method == 'MuSiC'))
lm.beta.NNLS = lm(ct.prop ~ HbA1c + Age + BMI + Gender, data = subset(m.prop.ana, Method == 'NNLS'))
summary(lm.beta.MuSiC)
summary(lm.beta.NNLS)
從上圖我們可以看出β細(xì)胞比例與糖化血紅蛋白水平呈現(xiàn)明顯的負(fù)調(diào)節(jié)關(guān)系,而且MuSiC算出的系數(shù)比非負(fù)矩陣更顯著。
對小鼠bulk數(shù)據(jù)進(jìn)行分析
# Mouse bulk dataset
Mouse.bulk.eset = readRDS('Mousebulkeset.rds')
Mouse.bulk.eset
# Download EMTAB single cell dataset from Github
Mousesub.eset = readRDS('Mousesubeset.rds')
Mousesub.eset
levels(Mousesub.eset$cellType)
# [1] "Endo" "Podo" "PT" "LOH" "DCT" "CD-PC" "CD-IC" "CD-Trans" "Novel1"
#[10] "Fib" "Macro" "Neutro" "B lymph" "T lymph" "NK" "Novel2"
# Produce the first step information
Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID',
select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib',
'Macro', 'Neutro','B lymph', 'T lymph', 'NK'))
# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
par(mfrow = c(1, 2))
d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
# hc2 <- hclust(d, method = "complete" )
hc2 <- hclust(d, method = "complete")
# Plot the obtained dendrogram
plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')
這個(gè)圖是對不同細(xì)胞進(jìn)行聚類,可以看出免疫細(xì)胞和非免疫細(xì)胞聚類不同。
clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))
cl.type = as.character(Mousesub.eset$cellType)
for(cl in 1:length(clusters.type)){
cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
}
pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))
# 13 selected cell types
s.mouse = unlist(clusters.type)
s.mouse
load('IEmarkers.RData')
# This RData file provides two vectors of gene names Epith.marker and Immune.marker
# We now construct the list of group marker
IEmarkers = list(NULL, NULL, Epith.marker, Immune.marker)
names(IEmarkers) = c('C1', 'C2', 'C3', 'C4')
# The name of group markers should be the same as the cluster names
Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers,
clusters = 'cellType', groups = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)
用已知細(xì)胞構(gòu)成對MuSiC定量結(jié)果進(jìn)行評估
XinT2D.construct.full = bulk_construct(XinT2D.eset, clusters = 'cellType', samples = 'SubjectName')
names(XinT2D.construct.full)
# [1] "Bulk.counts" "num.real"
XinT2D.construct.full$Bulk.counts
head(XinT2D.construct.full$num.real)
XinT2D.construct.full$prop.real = relative.ab(XinT2D.construct.full$num.real, by.col = FALSE)
head(XinT2D.construct.full$prop.real)
# Estimate cell type proportions of artificial bulk data
Est.prop.Xin = music_prop(bulk.eset = XinT2D.construct.full$Bulk.counts, sc.eset = EMTAB.eset,
clusters = 'cellType', samples = 'sampleID',
select.ct = c('alpha', 'beta', 'delta', 'gamma'))
# Estimation evaluation
Eval_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real),
prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted),
data.matrix(Est.prop.Xin$Est.prop.allgene)),
method.name = c('MuSiC', 'NNLS'))
prop.comp.fig = Prop_comp_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real),
prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted),
data.matrix(Est.prop.Xin$Est.prop.allgene)),
method.name = c('MuSiC', 'NNLS'),
title = 'Heatmap of Real and Est. Prop' )
abs.diff.fig = Abs_diff_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real),
prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted),
data.matrix(Est.prop.Xin$Est.prop.allgene)),
method.name = c('MuSiC', 'NNLS'),
title = 'Abs.Diff between Real and Est. Prop' )
plot_grid(prop.comp.fig, abs.diff.fig, labels = "auto", rel_widths = c(4,3))
從這個(gè)圖中我們可以看出基于MuSiC算出的各種細(xì)胞比例和實(shí)際細(xì)胞比例相似度很高,比非負(fù)矩陣的結(jié)果要好很多。在原文中作者還和 CIBERSORT (Newman et al. 2015) and bseq-sc ( Baron et al. 2016)都進(jìn)行了對比,感興趣的小伙伴可以去看看原文。
小結(jié)
雖然小編夸了MuSiC一路,但是還是要實(shí)事求是的看待這個(gè)軟件,作者上述的結(jié)果肯定都是選擇利于自身呈現(xiàn)軟件功能進(jìn)行挑選過的。在實(shí)際應(yīng)用過程中小編發(fā)現(xiàn)MuSiC有一些問題還需要理性對待,如它對很多比例很低的細(xì)胞群識別能力很差;而且出現(xiàn)要基于整個(gè)組織全部細(xì)胞進(jìn)行計(jì)算才比較準(zhǔn)確,無法自定義檢測某一些特殊的細(xì)胞亞群。但是好在這個(gè)軟件很輕便,我們可以結(jié)合實(shí)驗(yàn)或者其它算法來綜合使用。