一、seurat執(zhí)行差異表達(dá)分析
根據(jù)單細(xì)胞數(shù)據(jù)預(yù)處理的方式不同(lognormalize和sctransform),執(zhí)行差異表達(dá)分析代碼有所不同,這個(gè)需要注意。我當(dāng)時(shí)請(qǐng)教過(guò)生信分析同行,也跟人交流過(guò),再此記錄一下。
1.1 lognormalize預(yù)處理
seurat官網(wǎng)給出的標(biāo)準(zhǔn)分析流程,通過(guò)lognormalize預(yù)處理,歸一化的數(shù)據(jù)儲(chǔ)存在seurat_obj[['RNA']]@data,我們用歸一化后的data數(shù)據(jù)進(jìn)行差異基因分析。對(duì)于多樣本整合分析,同樣是在RNA assay上做差異表達(dá)分析。
# Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
# Identification of highly variable features
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Scaling the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
# find markers for every cluster compared to all remaining cells, report only the positive
DefaultAssay(pbmc) <- "RNA"
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
1.2 sctransform預(yù)處理
在前面sctransform的文章中,通過(guò)sctransform預(yù)處理,標(biāo)準(zhǔn)化的結(jié)果保存在一個(gè)新的assay中(默認(rèn)命名為 SCT),包含:
- seurat_obj[['SCT']]@counts:矯正后的counts;
- seurat_obj[['SCT']]@data:矯正后的counts通過(guò) log1p(counts)處理后的data;
- seurat_obj[['SCT']]@scale.data:pearson殘差;
問(wèn)題:通過(guò)sctransform預(yù)處理,我們究竟怎么選擇assay進(jìn)行差異基因分析?
如果依舊應(yīng)用seurat_obj[['RNA']]@data進(jìn)行分析,顯然行不通,在seurat官網(wǎng)sctransform教程中沒(méi)有明確的指導(dǎo),這也是讓我當(dāng)時(shí)感到很困惑的一點(diǎn)。
在sctransform的教材中,作者提到
You can use the corrected log-normalized counts for differential expression and integration. However, in principle, it would be most optimal to perform these calculations directly on the residuals (stored in the scale.data slot) themselves. This is not currently supported in Seurat v3, but will be soon.
我們可以這樣理解,校正的log-normalized歸一化計(jì)數(shù)(seurat_obj[['SCT']]@data)可以用于差異表達(dá)分析。但是,原則上,最好直接對(duì)殘差(seurat_obj[['SCT']]@scale.data)本身執(zhí)行這些計(jì)算。目前,Seurat v3 不支持此功能,但很快就會(huì)支持。
但是在Seurat v4,我們依舊沒(méi)有等到seurat官網(wǎng)對(duì)此分析流程的官宣。
但這也說(shuō)明了seurat_obj[['SCT']]@data是可行的。
DefaultAssay(seurat_obj) <- "SCT"
all.markers.SCT <- FindAllMarkers(seurat_obj, only.pos = T, min.pct = 0.5, logfc.threshold = 0.5)
作者在discussions中也給出了一段說(shuō)明:
We had anticipated extending Seurat to actively support DE using the pearson residuals of sctransform, but have decided not to do so. In some cases, Pearson residuals may not be directly comparable across different datasets, particularly if there are batch effects that are unrelated to sequencing depth. While it is possible to correct these differences using the SCTransform-based integration workflow for the purposes of visualization/clustering/etc., we do not recommend running differential expression directly on Pearson residuals. Instead, we recommend running DE on the standard RNA assay.
我們?cè)谕褂胹ctransform的皮爾遜殘差擴(kuò)展 Seurat 以積極支持DE,但我們決定不這樣做。 在某些情況下,Pearson 殘差可能無(wú)法在不同數(shù)據(jù)集之間直接進(jìn)行比較,尤其是當(dāng)存在與測(cè)序深度無(wú)關(guān)的批次效應(yīng)時(shí)。 雖然可以使用基于SCTransform 的分析工作流來(lái)矯正這些實(shí)驗(yàn)差異以實(shí)現(xiàn)可視化/聚類(lèi)等目的,但我們不建議直接在Pearson殘差上運(yùn)行差異表達(dá)分享。 相反,我們建議在標(biāo)準(zhǔn)RNA assay中運(yùn)行DE分析。
我們可以這樣理解,Satija團(tuán)隊(duì)本來(lái)期望用sctransform的皮爾遜殘差進(jìn)行DE分析,但是他們發(fā)現(xiàn),Pearson殘差可能無(wú)法在不同數(shù)據(jù)集之間直接進(jìn)行比較,決定不這么做。基于SCTransform 的分析工作流可以剔除實(shí)驗(yàn)差異以實(shí)現(xiàn)可基因可視化和聚類(lèi)等功能。但是他們建議在標(biāo)準(zhǔn)RNA assay中運(yùn)行DE分析。
我們用一個(gè)具體的案例來(lái)實(shí)踐一下。
step1:下載一個(gè)10X單細(xì)胞數(shù)據(jù),進(jìn)行基本的數(shù)據(jù)質(zhì)控;
library(Seurat)
library(ggplot2)
library(SingleR)
library(dplyr)
library(celldex)
library(RColorBrewer)
library(SingleCellExperiment)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# part1: Basic quality control and filtering
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
adj.matrix <- Read10X("data/update/soupX_pbmc10k_filt")
srat <- CreateSeuratObject(adj.matrix,project = "pbmc10k")
head(meta)
# orig.ident nCount_RNA nFeature_RNA
# AAACCCACATAACTCG-1 pbmc10k 22196 4734
# AAACCCACATGTAACC-1 pbmc10k 7630 2191
# AAACCCAGTGAGTCAG-1 pbmc10k 21358 4246
# AAACCCAGTGCTTATG-1 pbmc10k 857 342
# AAACGAACAGTCAGTT-1 pbmc10k 15007 4075
# AAACGAACATTCGGGC-1 pbmc10k 9855 2285
srat[["percent.mt"]] <- PercentageFeatureSet(srat, pattern = "^MT-")
srat[["percent.rb"]] <- PercentageFeatureSet(srat, pattern = "^RP[SL]")
# 去除雙細(xì)胞
doublets <- read.table("data/update/scrublet_calls.tsv",header = F,row.names = 1)
colnames(doublets) <- c("Doublet_score","Is_doublet")
srat <- AddMetaData(srat,doublets)
head(srat[[]])
srat[['QC']] <- ifelse(srat@meta.data$Is_doublet == 'True','Doublet','Pass')
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC == 'Pass','Low_nFeature',srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC != 'Pass' & srat@meta.data$QC != 'Low_nFeature',paste('Low_nFeature',srat@meta.data$QC,sep = ','),srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$percent.mt > 15 & srat@meta.data$QC == 'Pass','High_MT',srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC != 'Pass' & srat@meta.data$QC != 'High_MT',paste('High_MT',srat@meta.data$QC,sep = ','),srat@meta.data$QC)
table(srat[['QC']])
# Doublet High_MT High_MT,Low_nFeature High_MT,Low_nFeature,Doublet Pass
# 546 548 275 1 8824
step2:數(shù)據(jù)進(jìn)行SCTransform預(yù)處理
srat <- subset(srat, subset = QC == 'Pass')
dim(srat)
# [1] 36601 8824
srat <- SCTransform(srat, method = "glmGamPoi", ncells = 8824,
vars.to.regress = c("percent.mt"), verbose = F)
srat
接下來(lái),我們比較下,RNA assay和SCT assay矩陣數(shù)據(jù)的一致性。
# RNA/SCT counts
gene_exp_matrix_RNA_counts <- srat@assays$RNA@counts
gene_exp_matrix_RNA_counts[1:20,1:20]
gene_exp_matrix_SCT_counts <- srat@assays$SCT@counts
gene_exp_matrix_SCT_counts[1:20,1:20]
testthat::expect_equal(gene_exp_matrix_RNA_counts,gene_exp_matrix_SCT_counts)
# Error: `gene_exp_matrix_RNA_counts` not equal to `gene_exp_matrix_SCT_counts`
# RNA/SCT data
gene_exp_matrix_RNA_data <- srat@assays$RNA@data
gene_exp_matrix_RNA_data[1:20,1:20]
gene_exp_matrix_SCT_data <- srat@assays$SCT@data
gene_exp_matrix_SCT_data[1:20,1:20]
testthat::expect_equal(gene_exp_matrix_RNA_data,gene_exp_matrix_SCT_data)
# Error: `gene_exp_matrix_RNA_data` not equal to `gene_exp_matrix_SCT_data`.
testthat::expect_equal(gene_exp_matrix_RNA_counts,gene_exp_matrix_RNA_data)
RNA assay中的data和SCT assay中的data矩陣數(shù)據(jù)不相同,但是RNA assay中的data等于RNA assay中的counts,也就是說(shuō)通過(guò)SCTransform預(yù)處理,seurat_obj[['RNA']]@data為原始矩陣數(shù)據(jù),是不能進(jìn)行下游分析的。

step3:完成后面的降維聚類(lèi)
# clustering
srat <- RunPCA(srat, verbose = F)
srat <- RunUMAP(srat, dims = 1:30, verbose = F)
srat <- FindNeighbors(srat, dims = 1:30, verbose = F)
srat <- FindClusters(srat, verbose = F)
table(srat[[]]$seurat_clusters)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 1720 1503 1214 1090 955 326 324 315 271 250 246 239 135 104 97 35
step4:進(jìn)行差異基因分析
推薦使用在RNA assay上做差異表達(dá)分析,而不是 SCTransform。
切換到RNA assay,對(duì)原始的基因counts矩陣進(jìn)行NormalizeData和ScaleData,獲取歸一化的seurat_obj[['RNA']]@data后,進(jìn)行差異基因分析。
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# part3: Differential expression and marker selection
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DefaultAssay(srat) <- "RNA"
srat <- NormalizeData(srat)
srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(srat)
srat <- ScaleData(srat, features = all.genes)
all.markers.RNA <- FindAllMarkers(srat, only.pos = T, min.pct = 0.5, logfc.threshold = 0.5)
dim(all.markers.RNA)
# [1] 4026 7
table(all.markers.RNA$cluster)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 541 136 129 75 531 448 65 182 261 155 139 142 250 524 121 327
另外,我們使用SCT assay進(jìn)行差異表達(dá)分析,與上面的結(jié)果做對(duì)比。
DefaultAssay(srat) <- "SCT"
all.markers.SCT <- FindAllMarkers(srat, only.pos = T, min.pct = 0.5, logfc.threshold = 0.5)
dim(all.markers.SCT)
# [1] 3458 7
table(all.markers.SCT$cluster)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# 464 118 127 65 453 398 50 151 228 134 122 138 209 459 101 241
接下來(lái),我們可視化每個(gè)cluster的marker基因差異。
gene_df <- data.frame()
for (i in levels(all.markers.RNA$cluster)){
share_num <- length(intersect(x=all.markers.RNA[all.markers.RNA$cluster==i,]$gene, y = all.markers.SCT[all.markers.SCT$cluster==i,]$gene))
rna_num <- length(setdiff(x=all.markers.RNA[all.markers.RNA$cluster==i,]$gene, y = all.markers.SCT[all.markers.SCT$cluster==i,]$gene))
sct_num <- length(setdiff(x=all.markers.SCT[all.markers.SCT$cluster==i,]$gene, y = all.markers.RNA[all.markers.RNA$cluster==i,]$gene))
gene_df <- rbind(gene_df, data.frame("cluster"=i, "share_Gene"=share_num, "RNA_Gene" = rna_num, "SCT_Gene"=sct_num))
}
gene_data <- reshape2::melt(gene_df,id.vars=c("cluster"), variable.name="Type",value.name="gene_num")
head(gene_data)
# cluster Type gene_num
# 1 0 share_Gene 464
# 2 1 share_Gene 118
# 3 2 share_Gene 125
# 4 3 share_Gene 65
# 5 4 share_Gene 451
# 6 5 share_Gene 393
gene_data$cluster <- factor(gene_data$cluster,levels = order(as.numeric(unique(gene_data$cluster))))
ggplot(data=gene_data, aes(x=cluster, y=gene_num, fill=Type))+ geom_bar(stat="identity") + theme_bw()
用SCT assay查找的差異基因記錄為all.markers.SCT,用RNA assay查找的差異基因記錄為all.markers.RNA,對(duì)每個(gè)cluster的差異基因進(jìn)行分類(lèi):"share_Gene"表示markers.SCT和markers.RNA共有的基因;"RNA_Gene"表示markers.RNA獨(dú)有的基因;"SCT_Gene"表示markers.SCT獨(dú)有的基因。從下圖可以看出,cluster0-15,紅色條塊的"share_Gene"占主要部分,"RNA_Gene"和"SCT_Gene"占小部分。這個(gè)例子說(shuō)明,用SCT assay查找的差異基因和用RNA assay查找的差異基因大致是相同的。

1.3 小結(jié)
問(wèn):sctransform預(yù)處理后,如何進(jìn)行差異表達(dá)分析?
答:雖然普遍認(rèn)為,SCTransform和RNA assay計(jì)算出的差異基因大致是一致的。
但是作者的答復(fù)是:推薦使用在RNA assay上做差異表達(dá)分析,而不是SCTransform。切記一定要進(jìn)行NormalizeData和ScaleData標(biāo)準(zhǔn)化處理。
也就是說(shuō),SCTransform預(yù)處理適合降維聚類(lèi),基因可視化;做差異表達(dá)分析,lognormalize預(yù)處理更適宜一些。
參考代碼鏈接:https://pan.baidu.com/s/1wJFhPQoKSKTjqt3Qhd4a7w
提取碼:1234