利用 Dirichlet-multinomial regression 計(jì)算不同條件下亞群豐度變化

image.png

方法來源于上面這篇文章,不得不說,這篇文章運(yùn)用了非常多復(fù)雜的方法去闡述關(guān)注的科學(xué)問題,真不愧是出自Broad institute實(shí)驗(yàn)室的。我這里暫時(shí)只講下文章中一種比較新穎的比較不同條件下亞群豐度變化的方法。

首先我們先了解下Dirichlet-multinomial regression
讓我們從數(shù)學(xué)層面開始:
假設(shè)從正常組織取了sample i,正常組織本身包含了p種cell type,假設(shè)各種cell type出現(xiàn)的概率為

Cell probability
, sample i中各種cell type出現(xiàn)的數(shù)目為
Number of cell type

sample i中共有N個(gè)cell,全部細(xì)胞的總和就是,
Total number of cells

那么匯總下,這個(gè)sample的概率分布就為

其中

是 sample i 對(duì)應(yīng)的多項(xiàng)分布的參數(shù), 滿足

對(duì)于多項(xiàng)分布, 其均值和方差均可由該參數(shù)決定, 即設(shè) Xij 是相應(yīng)位置的隨機(jī)變量, 則有
引入Dirichlet概率分布

對(duì)于同一個(gè)tissue不同的sample,不同sample之間的成分參數(shù)

可能會(huì)相差很大,具有‘超散布性’ (overdispersion),也就是說觀測(cè)到的不同樣本中成分參數(shù)的方差會(huì)顯著大于多項(xiàng)分布模型下給出的方差。如何在建模中考慮這種超散布性,并且不給模型增加太多參數(shù)呢?我們假設(shè)不同樣本的成分參數(shù)來自同一隨機(jī)變量的不同實(shí)現(xiàn),那么我們可以使用Dirichlet分布來model 成分參數(shù)分布。



其中

是Dirichlet分布參數(shù),滿足

結(jié)合前面的multinomial distribution,我們可以寫出 DM 模型的概率分布為

DM 模型的另一種等價(jià)的參數(shù)化方法是令

則概率分布可寫為
這種參數(shù)化方式的好處是參數(shù)的含義更易解釋, 因?yàn)橛?div id="u0z1t8os" class="image-package">

因此,對(duì)于包含多個(gè)cell type的tissue sample data,可以通過fit DM 模型的方式對(duì)兩個(gè)sample中各個(gè)cell type abundance進(jìn)行統(tǒng)計(jì)檢驗(yàn)。
上面發(fā)表在Cell上的這篇文章就是用到的就是DirichletReg這個(gè)包,來檢驗(yàn)正常人組織 VS 病人組織non-inflamed sample間 哪些cell type abundance存在明顯差異,以及檢驗(yàn)正常人組織 VS 病人組織inflamed sample間 哪些cell type abundance存在明顯差異。

這里是示例代碼
# source code downloaded from:
# https://github.com/cssmillie/ulcerative_colitis
# there is a 'ulcerative_colitis-master' folder which contains a number of r code scripts

## load required libraries (analysis.r code includes all the libraries for running all analyses,
## but as here only shows the Cellular Composition difference test, only libraries listed below need to be loaded.
library(Seurat)
library(RColorBrewer) #for brewer.pal
library(Matrix) #for Matrix
library(DirichletReg)
library(data.table)
library(tidyverse)
library(cowplot)

## this function is extracted from analysis.r 
dirichlet_regression = function(counts, covariates, formula){  
  # Dirichlet multinomial regression to detect changes in cell frequencies
  # formula is not quoted, example: counts ~ condition
  # counts is a [samples x cell types] matrix
  # covariates holds additional data to use in the regression
  #
  # Example:
  # counts = do.call(cbind, tapply(seur@data.info$orig.ident, seur@ident, table))
  # covariates = data.frame(condition=gsub('[12].*', '', rownames(counts)))
  # res = dirichlet_regression(counts, covariates, counts ~ condition)
  
  #ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals

  # Calculate regression
  counts = as.data.frame(counts)
  counts$counts = DR_data(counts)
  data = cbind(counts, covariates)
  fit = DirichReg(counts ~ condition, data)
  
  # Get p-values
  u = summary(fit)
  #compared with healthy condition, 15 vars. noninflame and inflame, 30pvalues
  pvals = u$coef.mat[grep('Intercept', rownames(u$coef.mat), invert=T), 4] 
  v = names(pvals)
  pvals = matrix(pvals, ncol=length(u$varnames))
  rownames(pvals) = gsub('condition', '', v[1:nrow(pvals)])
  colnames(pvals) = u$varnames
  fit$pvals = pvals
  
  fit
}



## not all **.r in analysis.r need to be 'source'.
## only these three below are required for cellular composition difference test.
source('mtx.r') #for sparse_cbind
source('plot.r') #for matrix_barplot
source('colors.r') #for set.colors

## Load metadata for discovery and validation cohorts
## data downloaded from: 
## https://portals.broadinstitute.org/single_cell/study/SCP259
meta = read.table('all.meta2.txt', sep='\t', header=T, row.names=1, stringsAsFactors=F)

# Read a list of cell subsets, including the group that each one belongs to
# Groups include: Epithelial, Endothelial, Fibroblasts, Glia, Myeloid, B, and T cells
cell_subsets = read.table('cell_subsets.txt', sep='\t', header=F, stringsAsFactors=F)

# load seurat objects
# data downloaded from:
# https://www.dropbox.com/sh/dn4gwdww8pmfebf/AACXYu8rda5LoLwuCZ8aZXfma?dl=0
   
epi.seur = readRDS('train.Epi.seur.rds')
fib.seur = readRDS('train.Fib.seur.rds')
imm.seur = readRDS('train.Imm.seur.rds')
  
# set counts matrices
epi.counts = epi.seur@assays[['RNA']]@counts
fib.counts = fib.seur@assays[['RNA']]@counts
imm.counts = imm.seur@assays[['RNA']]@counts
  

# Use dirichlet-multinomial regression to find significant changes in cell frequencies during disease
# ---------------------------------------------------------------------------------------------------

# Count each cell subset in every sample
epi.freq = as.matrix(as.data.frame.matrix(table(epi.seur@meta.data$Sample, epi.seur@meta.data$Cluster)))
fib.freq = as.matrix(as.data.frame.matrix(table(fib.seur@meta.data$Sample, fib.seur@meta.data$Cluster)))
imm.freq = as.matrix(as.data.frame.matrix(table(imm.seur@meta.data$Sample, imm.seur@meta.data$Cluster)))

# Combine counts into a single matrix
all.freq = sparse_cbind(list(epi.freq, fib.freq, imm.freq))

# For the validation cohort, we need to combine the replicate samples (because they are not independent)
# To construct a list of replicates, we remove "1", "2", "a", and "b" from the sample IDs
reps = gsub('[12]*[ab]*$', '', rownames(all.freq))
temp = as.matrix(data.frame(aggregate(as.matrix(all.freq), list(reps), sum), row.names=1))
colnames(temp) = colnames(all.freq)
all.freq = temp[,colSums(temp) > 0]

# Split matrix into "epithelial" and "lamina propria" cell subsets and samples
ep.ident = levels(epi.seur@active.ident)
lp.ident = c(levels(fib.seur@active.ident), levels(imm.seur@active.ident))
ep.freq = all.freq[grep('Epi', rownames(all.freq)), ep.ident]
lp.freq = all.freq[grep('LP', rownames(all.freq)), lp.ident]

# For the dirichlet-multinomial regression, we need to know the disease state for each sample
# We can get this from the metadata table as follows:
sample2health = data.frame(unique(data.frame(sample=gsub('[12]*[ab]*$', '', meta[,'Sample']), health=meta[,'Health'])), row.names=1)
ep.cov = data.frame(condition=factor(sample2health[rownames(ep.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(ep.freq))
lp.cov = data.frame(condition=factor(sample2health[rownames(lp.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(lp.freq))

# Calculate significant changes using dirichlet multinomial regression
# This returns a matrix of p-values for each cell type / disease state
# 3 condition
ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals
colnames(ep.pvals) = colnames(ep.freq)
lp.pvals = dirichlet_regression(counts=lp.freq, covariates=lp.cov, formula=counts ~ condition)$pvals
colnames(lp.pvals) = colnames(lp.freq)

# Plot epithelial cell proportions
ep.pct = 100*ep.freq/rowSums(ep.freq)
p1 = matrix_barplot(ep.pct, group_by=ep.cov$condition, pvals=ep.pvals, colors=set.colors)
save_plot(p1, file='train.Fig2A.epi_freqs.pdf', nrow=1, ncol=2.5)

# Plot lamina propria cell proportions
lp.pct = 100*lp.freq/rowSums(lp.freq)
p2 = matrix_barplot(lp.pct, group_by=lp.cov$condition, pvals=lp.pvals, colors=set.colors)
save_plot(p2, file='train.Fig2A.lp_freqs.pdf', nrow=1, ncol=2.5)
用自己的數(shù)據(jù)計(jì)算下
My scRNA-seq data
總結(jié)下,這篇文章運(yùn)用了非常多計(jì)算方法,當(dāng)然自己原創(chuàng)的部分也占據(jù)了很大部分,有必要好好學(xué)習(xí)下。

參考1:https://zhuanlan.zhihu.com/p/341941329
參考2:https://www.stat-center.pku.edu.cn/docs/20190226150810864721.pdf
原文:https://pubmed.ncbi.nlm.nih.gov/31348891/

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容