Seurat包學(xué)習(xí)筆記(二):Integration and Label Transfer

Seurat3引入了用于多個(gè)單細(xì)胞測(cè)序數(shù)據(jù)集進(jìn)行整合分析的新方法。這些方法可以對(duì)來(lái)自不同的個(gè)體、實(shí)驗(yàn)條件、測(cè)序技術(shù)甚至物種中收集來(lái)的數(shù)據(jù)進(jìn)行整合,旨在識(shí)別出不同數(shù)據(jù)集之間的共享細(xì)胞狀態(tài)(shared cell states)。
這些方法首先識(shí)別出不同數(shù)據(jù)集對(duì)之間的“錨(anchors)”,這些anchors代表了個(gè)體細(xì)胞之間成對(duì)的對(duì)應(yīng)關(guān)系(每個(gè)數(shù)據(jù)集中有一個(gè)),并假設(shè)它們?cè)醋韵嗤纳餇顟B(tài)。然后,再利用這些識(shí)別出的anchors用于協(xié)調(diào)不同的數(shù)據(jù)集,或者將信息從一個(gè)數(shù)據(jù)集傳輸?shù)搅硪粋€(gè)數(shù)據(jù)集。

image

一、標(biāo)準(zhǔn)工作流程進(jìn)行整合分析

在本例教程中,我們選擇了通過(guò)四種不同測(cè)序技術(shù)(CelSeq (GSE81076)、 CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人類胰島細(xì)胞數(shù)據(jù)集,我們通過(guò)SeuratData包來(lái)加載這個(gè)數(shù)據(jù)集。

安裝并加載所需的R包

# 安裝并加載SeuratData包
devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(Seurat)
# 查看SeuratData包搜集的數(shù)據(jù)集
AvailableData()
                     Dataset Version                                                        Summary species            system ncells                                                            tech         notes Installed InstalledVersion
cbmc.SeuratData         cbmc   3.0.0                   scRNAseq and 13-antibody sequencing of CBMCs   human CBMC (cord blood)   8617                                                        CITE-seq          <NA>      TRUE            3.0.0
hcabm40k.SeuratData hcabm40k   3.0.0 40,000 Cells From the Human Cell Atlas ICA Bone Marrow Dataset   human       bone marrow  40000                                                          10x v2          <NA>     FALSE            3.0.0
ifnb.SeuratData         ifnb   3.0.0                              IFNB-Stimulated and Control PBMCs   human              PBMC  13999                                                          10x v1          <NA>      TRUE            3.0.0
panc8.SeuratData       panc8   3.0.0               Eight Pancreas Datasets Across Five Technologies   human Pancreatic Islets  14892                SMARTSeq2, Fluidigm C1, CelSeq, CelSeq2, inDrops          <NA>      TRUE            3.0.0
pbmc3k.SeuratData     pbmc3k   3.0.0                                     3k PBMCs from 10X Genomics   human              PBMC   2700                                                          10x v1          <NA>      TRUE            3.0.0
pbmcsca.SeuratData   pbmcsca   3.0.0           Broad Institute PBMC Systematic Comparative Analysis   human              PBMC  31021 10x v2, 10x v3, SMARTSeq2, Seq-Well, inDrops, Drop-seq, CelSeq2 HCA benchmark     FALSE            3.0.0
# 下載安裝SeuratData包收集的特定數(shù)據(jù)集
InstallData("panc8")
# 加載數(shù)據(jù)集
library(panc8.SeuratData)
data("panc8")
panc8
An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features)

分割對(duì)象,構(gòu)建不同的數(shù)據(jù)集

head(panc8@meta.data)
        orig.ident nCount_RNA nFeature_RNA   tech replicate assigned_cluster
D101_5        D101   4615.810         1986 celseq    celseq             <NA>
D101_7        D101  29001.563         4209 celseq    celseq             <NA>
D101_10       D101   6707.857         2408 celseq    celseq             <NA>
D101_13       D101   8797.224         2964 celseq    celseq             <NA>
D101_14       D101   5032.558         2264 celseq    celseq             <NA>
D101_17       D101  13474.866         3982 celseq    celseq             <NA>
        celltype dataset
D101_5     gamma  celseq
D101_7    acinar  celseq
D101_10    alpha  celseq
D101_13    delta  celseq
D101_14     beta  celseq
D101_17   ductal  celseq
# 根據(jù)meta信息中不同的測(cè)序技術(shù)(tech)對(duì)Seurat對(duì)象進(jìn)行分割,構(gòu)建不同的數(shù)據(jù)集
pancreas.list <- SplitObject(panc8, split.by = "tech")
# 選擇出四種不同測(cè)序技術(shù)產(chǎn)生的數(shù)據(jù)
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
pancreas.list
$celseq
An object of class Seurat 
34363 features across 1004 samples within 1 assay 
Active assay: RNA (34363 features, 0 variable features)

$celseq2
An object of class Seurat 
34363 features across 2285 samples within 1 assay 
Active assay: RNA (34363 features, 0 variable features)

$fluidigmc1
An object of class Seurat 
34363 features across 638 samples within 1 assay 
Active assay: RNA (34363 features, 0 variable features)

$smartseq2
An object of class Seurat 
34363 features across 2394 samples within 1 assay 
Active assay: RNA (34363 features, 0 variable features)

分別對(duì)每個(gè)數(shù)據(jù)集進(jìn)行標(biāo)準(zhǔn)的預(yù)處理

for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
    pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
}

將不同的數(shù)據(jù)集進(jìn)行整合

首先使用FindIntegrationAnchors函數(shù)來(lái)識(shí)別anchors,該函數(shù)接受Seurat對(duì)象的列表(list)作為輸入,在這里我們將三個(gè)對(duì)象構(gòu)建成一個(gè)參考數(shù)據(jù)集。使用默認(rèn)參數(shù)來(lái)識(shí)別錨,如數(shù)據(jù)集的“維數(shù)”(30)

reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3514 anchors
Filtering anchors
    Retained 2761 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++                                 | 33% ~25s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3500 anchors
Filtering anchors
    Retained 2728 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++                | 67% ~12s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 6174 anchors
Filtering anchors
    Retained 4561 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 49s

pancreas.anchors
An AnchorSet object containing 20100 anchors between 3 Seurat objects 
 This can be used as input to IntegrateData or TransferData.

然后將這些識(shí)別好的anchors傳遞給IntegrateData函數(shù),整合后的數(shù)據(jù)返回一個(gè)Seurat對(duì)象,該對(duì)象中將包含一個(gè)新的Assay(integrated),里面存儲(chǔ)了整合后表達(dá)矩陣,原始的表達(dá)矩陣存儲(chǔ)在RNA這個(gè)Assay中。

pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data

pancreas.integrated
An object of class Seurat 
36363 features across 5683 samples within 2 assays 
Active assay: integrated (2000 features, 2000 variable features)
 1 other assay present: RNA

對(duì)整合后的數(shù)據(jù)集進(jìn)行常規(guī)的降維聚類可視化

library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
# 數(shù)據(jù)標(biāo)準(zhǔn)化
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
# PCA降維
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
DimPlot(pancreas.integrated, reduction = "pca", group.by = "tech")
image
# UMAP降維可視化
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
19:12:41 UMAP embedding parameters a = 0.9922 b = 1.112
19:12:41 Read 5683 rows and found 30 numeric columns
19:12:41 Using Annoy for neighbor search, n_neighbors = 30
19:12:41 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:12:43 Writing NN index file to temp file /tmp/RtmpwSZSgF/file2786475d27897
19:12:43 Searching Annoy index using 1 thread, search_k = 3000
19:12:45 Annoy recall = 100%
19:12:46 Commencing smooth kNN distance calibration using 1 thread
19:12:49 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
19:12:49 Initializing from PCA
19:12:49 PCA: 2 components explained 44.22% variance
19:12:49 Commencing optimization for 500 epochs, with 252460 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:13:07 Optimization finished

# 使用group.by函數(shù)根據(jù)不同的條件進(jìn)行分群
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2
image
p3 <- DimPlot(pancreas.integrated, reduction = "umap", split.by = "tech")
p3
image

使用整合后的參考數(shù)據(jù)集對(duì)細(xì)胞類型進(jìn)行分類

Seurat3還支持將參考數(shù)據(jù)集(或元數(shù)據(jù))投影到查詢對(duì)象上。雖然許多方法是一致的(這兩個(gè)過(guò)程都是從識(shí)別錨開始的),但數(shù)據(jù)映射(data transfer)和數(shù)據(jù)整合(data integration)之間有兩個(gè)重要的區(qū)別:
1)In data transfer, Seurat does not correct or modify the query expression data.
2)In data transfer, Seurat has an option (set by default) to project the PCA structure of a reference onto the query, instead of learning a joint structure with CCA. We generally suggest using this option when projecting data between scRNA-seq datasets.

識(shí)別到anchors之后,我們使用TransferData函數(shù)根據(jù)參考數(shù)據(jù)集中細(xì)胞類型標(biāo)簽向量對(duì)查詢數(shù)據(jù)集的細(xì)胞進(jìn)行分類。TransferData函數(shù)返回一個(gè)帶有預(yù)測(cè)id和預(yù)測(cè)分?jǐn)?shù)的矩陣,我們可以將其添加到query metadata中。

# 構(gòu)建query數(shù)據(jù)集
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.query
An object of class Seurat 
34363 features across 638 samples within 1 assay 
Active assay: RNA (34363 features, 2000 variable features)

# 識(shí)別參考數(shù)據(jù)集的anchors
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, dims = 1:30)
pancreas.anchors
An AnchorSet object containing 20100 anchors between 3 Seurat objects 
 This can be used as input to IntegrateData or TransferData.

# 將查詢數(shù)據(jù)集映射到參考數(shù)據(jù)集上
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, dims = 1:30)
# 添加預(yù)測(cè)出的信息
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)

因?yàn)槲覀兙哂衼?lái)自整合后數(shù)據(jù)集中含有的原始注釋標(biāo)簽,所以我們可以評(píng)估預(yù)測(cè)的細(xì)胞類型注釋與完整參考的匹配程度。在此示例中,我們發(fā)現(xiàn)在細(xì)胞類型分類中具有很高的一致性,有超過(guò)97%的細(xì)胞被正確的標(biāo)記出。

pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
## FALSE  TRUE 
##    18   620

為了進(jìn)一步驗(yàn)證這一點(diǎn),我們可以查看一些特定胰島細(xì)胞群中的典型細(xì)胞類型標(biāo)記基因(cell type markers)。

table(pancreas.query$predicted.id)
## 
##             acinar activated_stellate              alpha               beta 
##                 21                 17                248                258 
##              delta             ductal        endothelial            epsilon 
##                 22                 33                 13                  1 
##              gamma         macrophage               mast            schwann 
##                 17                  1                  2                  5

VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
image

二、使用SCTransform方法進(jìn)行整合分析

Here, instead, we will harmonize the Pearson residuals that are output from SCTransform. As demonstrated below, the workflow consists of the following steps:

  • Create a list of Seurat objects to integrate
  • Perform SCTransform normalization separately for each dataset
  • Run the PrepSCTIntegration function on the object list
  • Integrate datasets, and proceed with joint analysis
    首先,構(gòu)建Seurat對(duì)象列表,并分別對(duì)每個(gè)對(duì)象運(yùn)行SCTransform進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化:
library(Seurat)
library(ggplot2)
library(patchwork)
options(future.globals.maxSize = 4000 * 1024^2)
data("panc8")
pancreas.list <- SplitObject(panc8, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- SCTransform(pancreas.list[[i]], verbose = FALSE)
}

接下來(lái),選擇用于數(shù)據(jù)整合的一些features,并運(yùn)行PrepSCTIntegration

pancreas.features <- SelectIntegrationFeatures(object.list = pancreas.list, nfeatures = 3000)
pancreas.list <- PrepSCTIntegration(object.list = pancreas.list, anchor.features = pancreas.features, verbose = FALSE)

然后使用FindIntegrationAnchors識(shí)別anchors,并運(yùn)行IntegrateData進(jìn)行數(shù)據(jù)集的整合,確保設(shè)置了normalization.method = 'SCT'。

pancreas.anchors <- FindIntegrationAnchors(object.list = pancreas.list, normalization.method = "SCT", anchor.features = pancreas.features, verbose = FALSE)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, normalization.method = "SCT", verbose = FALSE)

對(duì)整合后的數(shù)據(jù)進(jìn)行下游的降維可視化

pancreas.integrated <- RunPCA(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, dims = 1:30)
plots <- DimPlot(pancreas.integrated, group.by = c("tech", "celltype"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 3, byrow = TRUE, override.aes = list(size = 3)))
image

三、基于Reference-based的方法進(jìn)行整合分析

As an alternative, we introduce here the possibility of specifying one or more of the datasets as the ‘reference’ for integrated analysis, with the remainder designated as ‘query’ datasets. In this workflow, we do not identify anchors between pairs of query datasets, reducing the number of comparisons. For example, when integrating 10 datasets with one specified as a reference, we perform only 9 comparisons. Reference-based integration can be applied to either log-normalized or SCTransform-normalized datasets.

library(Seurat)
library(SeuratData)
library(ggplot2)
library(patchwork)
InstallData("pbmcsca")
data("pbmcsca")
# 分割數(shù)據(jù)集構(gòu)建Seurat對(duì)象列表
pbmc.list <- SplitObject(pbmcsca, split.by = "Method")
# 分別對(duì)每個(gè)對(duì)象進(jìn)行SCTransform標(biāo)準(zhǔn)化處理
for (i in names(pbmc.list)) {
    pbmc.list[[i]] <- SCTransform(pbmc.list[[i]], verbose = FALSE)
}

# 選擇用于數(shù)據(jù)集整合的features
pbmc.features <- SelectIntegrationFeatures(object.list = pbmc.list, nfeatures = 3000)
# 執(zhí)行PrepSCTIntegration處理
pbmc.list <- PrepSCTIntegration(object.list = pbmc.list, anchor.features = pbmc.features)

# 選擇參考數(shù)據(jù)集
reference_dataset <- which(names(pbmc.list) == "10x Chromium (v3)")

# 識(shí)別整合的anchors
pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc.list, normalization.method = "SCT", anchor.features = pbmc.features, reference = reference_dataset)
# 進(jìn)行數(shù)據(jù)整合
pbmc.integrated <- IntegrateData(anchorset = pbmc.anchors, normalization.method = "SCT")

# 數(shù)據(jù)降維可視化
pbmc.integrated <- RunPCA(object = pbmc.integrated, verbose = FALSE)
pbmc.integrated <- RunUMAP(object = pbmc.integrated, dims = 1:30)
plots <- DimPlot(pbmc.integrated, group.by = c("Method", "CellType"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 4, byrow = TRUE, override.aes = list(size = 2.5)))
image
DefaultAssay(pbmc.integrated) <- "RNA"
# Normalize RNA data for visualization purposes
pbmc.integrated <- NormalizeData(pbmc.integrated, verbose = FALSE)
FeaturePlot(pbmc.integrated, c("CCR7", "S100A4", "GZMB", "GZMK", "GZMH", "TCL1A"))
image
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] splines   parallel  stats4    grid      stats     graphics  grDevices
 [8] utils     datasets  methods   base     

other attached packages:
 [1] loomR_0.2.1.9000            hdf5r_1.3.1                
 [3] R6_2.4.0                    mclust_5.4.5               
 [5] garnett_0.1.16              org.Hs.eg.db_3.8.2         
 [7] AnnotationDbi_1.46.0        pagedown_0.9.1             
 [9] devtools_2.1.0              usethis_1.5.1              
[11] celaref_1.2.0               scran_1.12.1               
[13] scRNAseq_1.10.0             FLOWMAPR_1.2.0             
[15] Seurat_3.1.4.9902           sctransform_0.2.0          
[17] patchwork_0.0.1             cowplot_1.0.0              
[19] DT_0.12                     RColorBrewer_1.1-2         
[21] shinydashboard_0.7.1        ggsci_2.9                  
[23] shiny_1.3.2                 stxBrain.SeuratData_0.1.1  
[25] pbmcsca.SeuratData_3.0.0    panc8.SeuratData_3.0.2     
[27] SeuratData_0.2.1            doParallel_1.0.14          
[29] iterators_1.0.12            binless_0.15.1             
[31] RcppEigen_0.3.3.5.0         foreach_1.4.7              
[33] scales_1.0.0                dplyr_0.8.3                
[35] pagoda2_0.1.0               harmony_1.0                
[37] Rcpp_1.0.2                  conos_1.1.2        

參考來(lái)源:https://satijalab.org/seurat/v3.1/integration.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容