Seurat-多組數(shù)據(jù)整合分析標(biāo)準(zhǔn)工作流程Integration and Label Transfer

標(biāo)準(zhǔn)工作流程

在這個(gè)例子工作流中,我們演示了我們最近在論文中引入的兩種新方法,Comprehensive Integration of Single Cell Data:

  • 將多個(gè)不同的scna -seq數(shù)據(jù)集組裝到一個(gè)參考數(shù)據(jù)集中
  • 將細(xì)胞類型標(biāo)簽從參考數(shù)據(jù)集轉(zhuǎn)移到新的查詢數(shù)據(jù)集

在本例中,我們選擇了通過四種技術(shù)(CelSeq (GSE81076) CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人類胰島細(xì)胞數(shù)據(jù)集。為了方便起見,我們通過SeuratData包分發(fā)這個(gè)數(shù)據(jù)集。

新方法的代碼在Seurat v3中實(shí)現(xiàn)。您可以使用install.packages從CRAN下載和安裝。

除了新的方法之外,Seurat v3還包含了許多旨在改進(jìn)Seurat對象和用戶交互的改進(jìn)。為了幫助用戶熟悉這些更改,我們?yōu)槌R娙蝿?wù)編寫了一個(gè)命令備忘單( command cheat sheet)。

載入包和數(shù)據(jù):

library(Seurat)
packageVersion("Seurat")

>[1] ‘3.1.0’

library(SeuratData)
#InstallData("pbmc3k")
#InstallData("panc8")
#install.packages("C:/Users/Administrator/Desktop/panc8.SeuratData_3.0.2.tar.gz", repos = NULL, type = "source")
#install.packages("C:/Users/Administrator/Desktop/pbmcsca.SeuratData_3.0.0.tar.gz", repos = NULL, type = "source")
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features)
> summary(panc8)
Length  Class   Mode 
     1 Seurat     S4 
> citation('panc8.SeuratData')

在出版物中使用程序包時(shí)引用‘panc8.SeuratData’:

  Satija Lab (2019). panc8.SeuratData: Eight Pancreas
  Datasets Across Five Technologies. R package version
  3.0.2.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
    author = {Satija Lab},
    year = {2019},
    note = {R package version 3.0.2},
  }

Warning message:
In citation("panc8.SeuratData") :
  程序包‘panc8.SeuratData’里的DESCRIPTION文件中沒有日期這一域
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features)
> summary(panc8)
Length  Class   Mode 
     1 Seurat     S4 
> citation('panc8.SeuratData')

在出版物中使用程序包時(shí)引用‘panc8.SeuratData’:

  Satija Lab (2019). panc8.SeuratData: Eight Pancreas
  Datasets Across Five Technologies. R package version
  3.0.2.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
    author = {Satija Lab},
    year = {2019},
    note = {R package version 3.0.2},
  }

Warning message:
In citation("panc8.SeuratData") :
  程序包‘panc8.SeuratData’里的DESCRIPTION文件中沒有日期這一域

要構(gòu)造參考數(shù)據(jù)集,我們將在各個(gè)數(shù)據(jù)集之間標(biāo)識(shí)“錨”。首先,我們將組合的對象拆分為一個(gè)列表,每個(gè)數(shù)據(jù)集作為一個(gè)元素。

> pancreas.list <- SplitObject(panc8, split.by = "tech")
> summary(pancreas.list)
           Length Class  Mode
celseq     1      Seurat S4  
celseq2    1      Seurat S4  
smartseq2  1      Seurat S4  
fluidigmc1 1      Seurat S4  
indrop     1      Seurat S4  

在找到錨之前,我們執(zhí)行標(biāo)準(zhǔn)的預(yù)處理(log-normalization),并為每個(gè)錨分別識(shí)別變量特性。注意,Seurat v3實(shí)現(xiàn)了一種改進(jìn)的基于方差穩(wěn)定轉(zhuǎn)換(“vst”)的變量特征選擇方法。

pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
  pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
  pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst",  nfeatures = 2000, verbose = FALSE)
}

整合3個(gè)胰島細(xì)胞數(shù)據(jù)集

接下來,我們使用findintegrationanchor函數(shù)來標(biāo)識(shí)錨,該函數(shù)接受Seurat對象的列表(list)作為輸入。在這里,我們將三個(gè)對象集成到一個(gè)參考數(shù)據(jù)集中(稍后我們將使用第四個(gè)對象)。

> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3514 anchors
Filtering anchors
    Retained 2761 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++                                 | 33% ~25s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3500 anchors
Filtering anchors
    Retained 2728 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++                | 67% ~12s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 6174 anchors
Filtering anchors
    Retained 4561 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 49s

我們在這里使用所有默認(rèn)參數(shù)來標(biāo)識(shí)錨,包括數(shù)據(jù)集的“維數(shù)”(30;您可以嘗試在較大范圍內(nèi)更改此參數(shù),例如在10到50之間)。

然后我們將這些錨傳遞給IntegrateData函數(shù),該函數(shù)返回一個(gè)Seurat對象。
返回的對象將包含一個(gè)新的Assay,其中包含一個(gè)針對所有細(xì)胞的完整(或“批量校正”)表達(dá)矩陣,使它們能夠被聯(lián)合分析。

> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
> head(pancreas.integrated@meta.data)
        orig.ident nCount_RNA nFeature_RNA   tech replicate
D101_5        D101   4615.810         1986 celseq    celseq
D101_7        D101  29001.563         4209 celseq    celseq
D101_10       D101   6707.857         2408 celseq    celseq
D101_13       D101   8797.224         2964 celseq    celseq
D101_14       D101   5032.558         2264 celseq    celseq
D101_17       D101  13474.866         3982 celseq    celseq
        assigned_cluster celltype dataset
D101_5              <NA>    gamma  celseq
D101_7              <NA>   acinar  celseq
D101_10             <NA>    alpha  celseq
D101_13             <NA>    delta  celseq
D101_14             <NA>     beta  celseq
D101_17             <NA>   ductal  celseq

運(yùn)行IntegrateData后,Seurat對象將包含一個(gè)使用集成表達(dá)矩陣的Assay 。注意,原始值(未更正的值)仍然存儲(chǔ)在“RNA”測試中的對象中,因此您可以來回切換。
然后我們可以使用這個(gè)新的矩陣進(jìn)行下游分析和可視化。在這里,我們縮放集成的數(shù)據(jù),運(yùn)行PCA,并使用UMAP可視化結(jié)果。集成的數(shù)據(jù)集按細(xì)胞類型聚類,而不是按技術(shù)分群。

library(ggplot2)
library(cowplot)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
?ScaleData
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)

pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
20:55:25 Read 5683 rows and found 30 numeric columns
20:55:25 Using Annoy for neighbor search, n_neighbors = 30
20:55:26 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:28 Writing NN index file to temp file C:\Users\ADMINI~1\AppData\Local\Temp\Rtmpo5X81p\file2870430b3367
20:55:28 Searching Annoy index using 1 thread, search_k = 3000
20:55:30 Annoy recall = 100%
20:55:31 Commencing smooth kNN distance calibration using 1 thread
20:55:31 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
20:55:31 Initializing from PCA
20:55:31 PCA: 2 components explained 44.22% variance
20:55:32 Commencing optimization for 500 epochs, with 252460 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:59 Optimization finished
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, 
              repel = TRUE) + NoLegend()
plot_grid(p1, p2)
image.png

使用參考數(shù)據(jù)集對細(xì)胞類型進(jìn)行分類

Seurat v3還支持將參考數(shù)據(jù)(或元數(shù)據(jù))投影到查詢對象上。雖然許多方法是一致的(這兩個(gè)過程都是從識(shí)別錨開始的),但數(shù)據(jù)映射(data transfer)和整合之間有兩個(gè)重要的區(qū)別:

在數(shù)據(jù)映射中,Seurat不糾正或修改查詢表達(dá)式數(shù)據(jù)。
在數(shù)據(jù)映射中,Seurat有一個(gè)選項(xiàng)(默認(rèn)設(shè)置)來將參考的PCA結(jié)構(gòu)投射到查詢上,而不是學(xué)習(xí)與CCA的聯(lián)合結(jié)構(gòu)。我們通常建議在scna -seq數(shù)據(jù)集之間投影數(shù)據(jù)時(shí)使用此選項(xiàng)。

找到錨之后,我們使用TransferData函數(shù)根據(jù)參考數(shù)據(jù)(參考數(shù)據(jù)的細(xì)胞類型標(biāo)簽向量)對查詢數(shù)據(jù)集的細(xì)胞進(jìn)行分類。TransferData返回一個(gè)帶有預(yù)測id和預(yù)測分?jǐn)?shù)的矩陣,我們可以將其添加到查詢元數(shù)據(jù)中(query metadata)

> pancreas.query <- pancreas.list[["fluidigmc1"]]
> pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, 
+                                         dims = 1:30)
Performing PCA on the provided reference using 2000 features as input.
Projecting PCA
Finding neighborhoods
Finding anchors
    Found 953 anchors
Filtering anchors
    Retained 885 anchors
Extracting within-dataset neighbors
> predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, 
+                             dims = 1:30)
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
> pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)

因?yàn)槲覀冇衼碜酝暾煞治龅脑紭?biāo)簽注釋,所以我們可以評估預(yù)測的細(xì)胞類型注釋與完整引用的匹配程度。在這個(gè)例子中,我們發(fā)現(xiàn)在細(xì)胞類型分類上有很高的一致性,超過97%的細(xì)胞被正確標(biāo)記。

pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)

FALSE  TRUE 
   18   620 

為了進(jìn)一步驗(yàn)證這一點(diǎn),我們可以檢查特定胰島細(xì)胞群的一些典型細(xì)胞類型標(biāo)記(cell type markers)。請注意,即使這些細(xì)胞類型中的一些只由一兩個(gè)細(xì)胞(例如epsilon細(xì)胞),我們?nèi)匀荒軌蛘_地對它們進(jìn)行分類。

> table(pancreas.query$predicted.id)

            acinar activated_stellate              alpha               beta              delta 
                21                 17                248                258                 22 
            ductal        endothelial            epsilon              gamma         macrophage 
                33                 13                  1                 17                  1 
              mast            schwann 
                 2                  5 
> VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")

image.png

https://satijalab.org/seurat/v3.2/integration.html
http://www.itdecent.cn/p/d43f16bdfed9

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容