幾個scRNA找高變異基因(HVGs)的方法

劉小澤寫于19.9.10
高變異基因就是highly variable features(HVGs),就是在細(xì)胞與細(xì)胞間進(jìn)行比較,選擇表達(dá)量差別最大的

1. Seurat

參考:https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html

利用FindVariableFeatures函數(shù),會計(jì)算一個mean-variance結(jié)果,也就是給出表達(dá)量均值和方差的關(guān)系并且得到top variable features

計(jì)算方法主要有三種:

  • vst(默認(rèn)):首先利用loess對 log(variance) 和log(mean) 擬合一條直線,然后利用觀測均值和期望方差對基因表達(dá)量進(jìn)行標(biāo)準(zhǔn)化,最后根據(jù)保留最大的標(biāo)準(zhǔn)化的表達(dá)量計(jì)算方差
  • mean.var.plot: 首先利用mean.function和 dispersion.function分別計(jì)算每個基因的平均表達(dá)量和離散情況,然后根據(jù)平均表達(dá)量將基因們分散到一定數(shù)量(默認(rèn)是20個)的小區(qū)間(bin)中,并且計(jì)算每個bin中z-score
  • dispersion(最直接):挑選最高離差值的基因

例如:使用Seurat 版本3

# V3 代碼來自官方教程https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)

# 分別繪制帶基因名和不帶基因名的
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

使用Seurat版本2

# V2
pbmc <- FindVariableGenes(object = pbmc, 
                         mean.function = ExpMean, 
                         dispersion.function = LogVMR )
length( pbmc@var.genes) 
# 默認(rèn)值是:x.low.cutoff = 0.1, x.high.cutoff = 8, y.cutoff = 1,就是說取log后的平均表達(dá)量(x軸)介于0.1-8之間的;分散程度(y軸,即標(biāo)準(zhǔn)差)至少為1的
  • V3計(jì)算mean.functionFastLogVMR均采用了加速的FastExpMean、FastLogVMR模式
  • V3橫坐標(biāo)范圍設(shè)定參數(shù)改成:mean.cutoff,整合了原來V2的x.low.cutoff + x.high.cutoff;縱坐標(biāo)改成:dispersion.cutoff ,替代了原來V2的y.cutoff
  • V3默認(rèn)選擇2000個差異基因,檢查方法也不同(V3用VariableFeatures(sce)檢查,V2用sce@var.genes檢查)

2. Monocle

參考:https://cole-trapnell-lab.github.io/monocle-release/docs/#clustering-cells

# V2
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp_table <- dispersionTable(cds) 
> head(disp_table)
     gene_id mean_expression dispersion_fit dispersion_empirical
1 AL669831.5     0.011673004      42.669671             0.000000
2      NOC2L     0.140316168       5.221419             1.696712
3    PLEKHN1     0.016292004      31.089206             0.000000
4 AL645608.8     0.009537725      51.814220             0.000000
5       HES4     0.265523990       3.619078            28.119205
6      ISG15     0.793811626       2.424032            11.047583

unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) 
cds <- setOrderingFilter(cds, unsup_clustering_genes$gene_id) 
plot_ordering_genes(cds) 

3. scran

參考:https://bioconductor.riken.jp/packages/3.7/workflows/vignettes/simpleSingleCell/inst/doc/work-5-mnn.html

利用函數(shù)trendVar()decomposeVar() 根據(jù)表達(dá)量計(jì)算highly variable genes (HVGs)

fit <- trendVar(sce, parametric=TRUE) 
dec <- decomposeVar(sce, fit)

如果有不感興趣的差異來源(例如plate、donor),為了確保后面鑒定HVGs過程不會擴(kuò)大真實(shí)的數(shù)據(jù)偏差,可以設(shè)置block

block <- paste0(sce$Plate, "_", sce$Donor)
fit <- trendVar(sce,block=block, parametric=TRUE) 
dec <- decomposeVar(sce, fit)

最后作圖

plot(dec$mean, dec$total, xlab="Mean log-expression", 
    ylab="Variance of log-expression", pch=16)
OBis.spike <- isSpike(sce)
points(dec$mean[is.spike], dec$total[is.spike], col="red", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)

decomposeVar函數(shù)幫助文檔中有一句描述:Highly variable genes (HVGs) can be identified as those with large biological components. The biological component is determined by subtracting the technical component from the total variance.

HVGs能夠代表大部分的生物學(xué)差異,而這種差異是由總體差異減去技術(shù)因素差異得到的

dec$Symbol <- rowData(dec)$Symbol
dec <- dec[order(dec$bio, decreasing=TRUE),]

> head(dec,2)
DataFrame with 2 rows and 7 columns
                            mean            total              bio
                       <numeric>        <numeric>        <numeric>
ENSG00000254647 2.83712754306791 6.30184692631371 5.85904290864641
ENSG00000129965 1.88188510741958 5.96360144483475  5.5152391307155
                             tech   p.value       FDR      Symbol
                        <numeric> <numeric> <numeric> <character>
ENSG00000254647 0.442804017667299         0         0         INS
ENSG00000129965 0.448362314119254         0         0    INS-IGF2

4. M3Drop

Brennecke et al. (2013) Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10, 1093-1095. doi:10.1038/nmeth.2645

library(M3DExampleData)
# 需要提供表達(dá)矩陣(expr_mat)=》normalized or raw (not log-transformed) 
HVG <- BrenneckeGetVariableGenes(expr_mat=M3DExampleData, spikes=NA, suppress.plot=FALSE, fdr=0.1, minBiolDisp=0.5, fitMeanQuantile=0.8)
HVG_spike <- BrenneckeGetVariableGenes(Mmus_example_list$data, spikes=5550:5600)

5. 自定義函數(shù)

Extract genes with a squared coefficient of variation >2 times the fit regression (Brennecke et al 2013 method)
實(shí)現(xiàn)了:Select the highly variable genes based on the squared coefficient of variation and the mean gene expression and return the RPKM matrix the the HVG

getMostVarGenes <- function(
  data=data,                # RPKM matrix
  fitThr=1.5,           # Threshold above the fit to select the HGV
  minMeanForFit=1           # Minimum mean gene expression level
){
  # data=females;fitThr=2;minMeanForFit=1   
  # Remove genes expressed in no cells
  data_no0 <- as.matrix(
    data[rowSums(data)>0,]
  )
  # Compute the mean expression of each genes
  meanGeneExp <- rowMeans(data_no0)
  names(meanGeneExp)<- rownames(data_no0)
  
  # Compute the squared coefficient of variation
  varGenes <- rowVars(data_no0)
  cv2 <- varGenes / meanGeneExp^2
  
  # Select the genes which the mean expression is above the expression threshold minMeanForFit
  useForFit <- meanGeneExp >= minMeanForFit
  
  # Compute the model of the CV2 as a function of the mean expression using GLMGAM
  fit <- glmgam.fit( cbind( a0 = 1, 
                            a1tilde = 1/meanGeneExp[useForFit] ), 
                     cv2[useForFit] )
  a0 <- unname( fit$coefficients["a0"] )
  a1 <- unname( fit$coefficients["a1tilde"])
  
  # Get the highly variable gene counts and names
  fit_genes <- names(meanGeneExp[useForFit])
  cv2_fit_genes <- cv2[useForFit]
  fitModel <- fit$fitted.values
  names(fitModel) <- fit_genes
  HVGenes <- fitModel[cv2_fit_genes>fitModel*fitThr]
  print(length(HVGenes))
  
  # Plot the result
  plot_meanGeneExp <- log10(meanGeneExp)
  plot_cv2 <- log10(cv2)
  plotData <-  data.frame(
    x=plot_meanGeneExp[useForFit],
    y=plot_cv2[useForFit],
    fit=log10(fit$fitted.values),
    HVGenes=log10((fit$fitted.values*fitThr))
  )
  p <- ggplot(plotData, aes(x,y)) +
    geom_point(size=0.1) +
    geom_line(aes(y=fit), color="red") +
    geom_line(aes(y=HVGenes), color="blue") +
    theme_bw() +
    labs(x = "Mean expression (log10)", y="CV2 (log10)")+
    ggtitle(paste(length(HVGenes), " selected genes", sep="")) +
    theme(
      axis.text=element_text(size=16),
      axis.title=element_text(size=16),
      legend.text = element_text(size =16),
      legend.title = element_text(size =16 ,face="bold"),
      legend.position= "none",
      plot.title = element_text(size=18, face="bold", hjust = 0.5),
      aspect.ratio=1
    )+
    scale_color_manual(
      values=c("#595959","#5a9ca9")
    )
  print(p)
  
  # Return the RPKM matrix containing only the HVG
  HVG <- data_no0[rownames(data_no0) %in% names(HVGenes),]
  return(HVG)
}

P.S.

參考:https://bioconductor.org/packages/release/workflows/vignettes/simpleSingleCell/inst/doc/var.html
https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html

還有其他的一些函數(shù):

  • the coefficient of variation, using the technicalCV2() function (Brennecke et al. 2013) or the DM() function (Kim et al. 2015) in scran, which quantify expression variance based on the coefficient of variation of the (normalized) counts. 另外還有technicalCV2()的升級版improvedCV2()
  • the dispersion parameter in the negative binomial distribution, using the estimateDisp() function in edgeR (McCarthy, Chen, and Smyth 2012).
  • a proportion of total variability, using methods in the BASiCS package (Vallejos, Marioni, and Richardson 2015).

歡迎關(guān)注我們的公眾號~_~  
我們是兩個農(nóng)轉(zhuǎn)生信的小碩,打造生信星球,想讓它成為一個不拽術(shù)語、通俗易懂的生信知識平臺。需要幫助或提出意見請后臺留言或發(fā)送郵件到jieandze1314@gmail.com

Welcome to our bioinfoplanet!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容