一、FindAllMarkers()函數(shù)報錯
最近事情太多,好久沒有寫簡書,斷更很久,慚愧。
前幾天,將服務(wù)器的seurat包升級到4.1.0版本,在跑單細(xì)胞轉(zhuǎn)錄組流程的時候發(fā)現(xiàn)在FindAllMarkers步驟報錯,想細(xì)究下原因。
前面的流程是我使用SCTransform方法對單細(xì)胞數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化;
代碼如下:
DefaultAssay(seurat_obj) <- "SCT"
Idents(seurat_obj) <- "cluster"
markers_all <- FindAllMarkers(object = seurat_obj, min.pct = 0.25, logfc.threshold = 0.25, only.pos=T)
報錯的截圖:

提示的信息是我需要在執(zhí)行FindMarkers()函數(shù)前,先運(yùn)行PrepSCTFindMarkers()函數(shù)。
二、PrepSCTFindMarkers()函數(shù)的用途
去年,對于SCTransform歸一化處理后的單細(xì)胞數(shù)據(jù)如何進(jìn)行差異基因表達(dá)分析,我就很糾結(jié),之前寫了一篇隨筆《sctransform預(yù)處理后,如何進(jìn)行差異表達(dá)分析》。如今,在2022年1月14日,seurat終于對SCTransform歸一化處理后如何合理進(jìn)行差異表達(dá)分析,官宣了PrepSCTFindMarkers()函數(shù)。
seurat團(tuán)隊?wèi)?yīng)該是慎之又慎,才宣布此函數(shù),很想看看它具體做了一些什么操作。
查了下PrepSCTFindMarkers()的說明
該函數(shù)的說明文檔:
https://rdrr.io/cran/Seurat/man/PrepSCTFindMarkers.html
https://cran.r-project.org/web/packages/Seurat/Seurat.pdf
函數(shù)用途:
Prepare object to run differential expression on SCT assay with multiple models
詳細(xì)說明:
Given a merged object with multiple SCT models, this function uses minimum of the median UMI (calculated using the raw UMI counts) of individual objects to reverse the individual SCT regression model using minimum of median UMI as the sequencing depth covariate. The counts slot of the SCT assay is replaced with recorrected counts and the data slot is replaced with log1p of recorrected counts.
大致翻譯一下:
對多樣本的單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)整合后生成的seurat對象,提前是已經(jīng)進(jìn)行了SCTransform歸一化處理。針對SCT模式下,進(jìn)行差異表達(dá)分析,需要對seurat對象進(jìn)行預(yù)處理,修正 SCT 模式下的counts矩陣和data矩陣。該函數(shù)計算每個樣本的UMI中位數(shù)(使用原始 UMI 計數(shù)計算),取其最小值,然后應(yīng)用于每個樣本的SCT回歸模型,使用UMI 中位數(shù)的最小值作為測序深度協(xié)變量。SCT模式下的counts矩陣被重新校正的counts值所取代,同時,data矩陣也被重新校正counts的 log1p 替換。
換句話說,用SCT模式的data矩陣做差異表達(dá)分析,不是很妥,需要預(yù)先用PrepSCTFindMarkers()函數(shù)校正,校正后的data矩陣,更適合做下游的差異基因表達(dá)分析。
函數(shù)用法:
PrepSCTFindMarkers(object, assay = "SCT", verbose = TRUE)
那為什么用原先的SCT模式的data做差異表達(dá)分析不行?
作者在曾在seurat的discussions部分中也給出了一段說明:
We had anticipated extending Seurat to actively support DE using the pearson residuals of sctransform, but have decided not to do so. In some cases, Pearson residuals may not be directly comparable across different datasets, particularly if there are batch effects that are unrelated to sequencing depth. While it is possible to correct these differences using the SCTransform-based integration workflow for the purposes of visualization/clustering/etc., we do not recommend running differential expression directly on Pearson residuals. Instead, we recommend running DE on the standard RNA assay.
我們曾期望擴(kuò)展Seurat的功能以支持使用sctransform的pearson殘差進(jìn)行DE分析,但已決定不這樣做。在某些情況下,皮爾遜殘差可能無法在不同的數(shù)據(jù)集之間直接進(jìn)行比較,尤其是當(dāng)存在與測序深度無關(guān)的批次效應(yīng)時。雖然為了可視化/聚類等目的,可以使用基于SCTransform的集成工作流來糾正這些差異,但我們不建議直接在Pearson殘差上運(yùn)行差分表達(dá)式。相反,我們建議在標(biāo)準(zhǔn)RNA分析中運(yùn)行DE分析。
我們可以這樣理解,Satija團(tuán)隊本來期望用sctransform的皮爾遜殘差進(jìn)行DE分析,但是他們發(fā)現(xiàn),Pearson殘差可能無法在不同數(shù)據(jù)集之間直接進(jìn)行比較,決定不這么做?;赟CTransform 的分析工作流可以剔除實(shí)驗(yàn)差異以實(shí)現(xiàn)可基因可視化和聚類等功能。但是他們建議在標(biāo)準(zhǔn)RNA assay中運(yùn)行DE分析。
給出的案例是:
針對SCTransform歸一化流程,在FindMarkers之前需要運(yùn)行PrepSCTFindMarkers()函數(shù)。
data("pbmc_small")
pbmc_small1 <- SCTransform(object = pbmc_small, variable.features.n = 20)
pbmc_small2 <- SCTransform(object = pbmc_small, variable.features.n = 20)
pbmc_merged <- merge(x = pbmc_small1, y = pbmc_small2)
pbmc_merged <- PrepSCTFindMarkers(object = pbmc_merged)
markers <- FindMarkers(
object = pbmc_merged,
ident.1 = "0",
ident.2 = "1",
assay = "SCT"
)
pbmc_subset <- subset(pbmc_merged, idents = c("0", "1"))
markers_subset <- FindMarkers(
object = pbmc_subset,
ident.1 = "0",
ident.2 = "1",
assay = "SCT",
recorrect_umi = FALSE
)
三、PrepSCTFindMarkers()函數(shù)解析
library(Seurat)
data("pbmc_small")
pbmc_small1 <- SCTransform(object = pbmc_small, variable.features.n = 20)
pbmc_small2 <- SCTransform(object = pbmc_small, variable.features.n = 20)
pbmc_merged <- merge(x = pbmc_small1, y = pbmc_small2)
pbmc_merged
# An object of class Seurat
# 450 features across 160 samples within 2 assays
# Active assay: SCT (220 features, 0 variable features)
# 1 other assay present: RNA
下面我們進(jìn)入到PrepSCTFindMarkers()函數(shù)中,代碼較長,我們分解一下:
step1:判斷seurat對象的SCT模型的數(shù)目,如果只有一個樣本,跳出此函數(shù),不需要進(jìn)行SCT模型的counts和data數(shù)據(jù)校正;
在我們的案例中,有2個樣本,即兩個SCR模型。
if (length(x = levels(x = object[[assay]])) == 1) {
if (verbose) {
message("Only one SCT model is stored - skipping recalculating corrected counts")
}
return(object)
}
step2:此處調(diào)用了SCTResults函數(shù),計算每個SCT模型的UMI中位數(shù)。先不細(xì)究該函數(shù)。
observed_median_umis <- lapply(
X = SCTResults(object = object[[assay]], slot = "cell.attributes"),
FUN = function(x) median(x[, "umi"])
)
observed_median_umis
# $model1
# [1] 180
#
# $model1.1
# [1] 180
step3:計算min_median_umi
model.list <- slot(object = object[[assay]], name = "SCTModel.list")
model.list
# $model1
# An sctransform model.
# Model formula: y ~ log_umi
# Parameters stored for 220 features, 80 cells
# $model1.1
# An sctransform model.
# Model formula: y ~ log_umi
# Parameters stored for 220 features, 80 cells
median_umi.status <- lapply(X = model.list,
FUN = function(x) { return(tryCatch(
expr = slot(object = x, name = 'median_umi'),
error = function(...) {return(NULL)})
)})
# median_umi.status
# $model1
# [1] 180
#
# $model1.1
# [1] 180
if (any(is.null(x = unlist(x = median_umi.status)))){
# For old SCT objects median_umi is set to median umi as calculated from obserbed UMIs
slot(object = object[[assay]], name = "SCTModel.list") <- lapply(X = model.list,
FUN = UpdateSlots)
SCTResults(object = object[[assay]], slot = "median_umi") <- observed_median_umis
}
model_median_umis <- SCTResults(object = object[[assay]], slot = "median_umi")
min_median_umi <- min(unlist(x = observed_median_umis))
step4:生成校正的corrected_counts矩陣
step5:對corrected_counts取log1p,生成校正的corrected_data矩陣
時間太趕,先大致記錄下,后面再補(bǔ)充。