Seurat包學習筆記(一):Guided Clustering Tutorial

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data.

image.png

Seurat3新增功能特色:

  • Improved and expanded methods for single-cell integration. Seurat v3 implements new methods to identify ‘anchors’ across diverse single-cell data types, in order to construct harmonized references, or to transfer information across experiments.

  • Improved methods for normalization. Seurat v3 includes support for sctransform, a new modeling approach for the normalization of single-cell data, described in a second preprint. Compared to standard log-normalization, sctransform effectively removes technically-driven variation while preserving biological heterogeneity.

  • An efficiently restructured Seurat object, with an emphasis on multi-modal data. We have carefully re-designed the structure of the Seurat object, with clearer documentation, and a flexible framework to easily switch between RNA, protein, cell hashing, batch-corrected / integrated, or imputed data.

安裝并加載所需的R包

目前最新版的Seurat包都是基于3.0以上的,可以直接通過install.packages命令進行安裝。

# 設置R包安裝鏡像
options(repos="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
install.packages('Seurat')
library(Seurat)
library(dplyr)
library(patchwork)
# 查看Seurat包的版本信息
packageVersion("Seurat")

如果想安裝早期2.0版本的Seurat包可以通過devtools包進行安裝

# 先安裝devtools包
install.packages('devtools')
# 使用devtools安裝指定版本的Seurat包
devtools::install_version(package = 'Seurat', version = package_version('2.3.4'))
library(Seurat)

構建Seurat對象

本教程使用的是來自10X Genomics平臺測序的外周血單核細胞(PBMC)數據集,這個數據集是用Illumina NextSeq 500平臺進行測序的,里面包含了2,700個細胞的RNA-seq數據。
這個原始數據是用CellRanger軟件進行基因表達定量處理的,最終生成了一個UMI的count矩陣。矩陣里的列是不同barcode指示的細胞,行是測序到的不同基因。接下來,我們將使用這個count矩陣創(chuàng)建Seurat對象。

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "/home/dongwei/scRNA-seq/data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
# 初步過濾:每個細胞中至少檢測到200個基因,每個基因至少在3個細胞中表達
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features)

查看原始count矩陣的信息

# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##                                                                    
## CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
## TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
## MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

表達矩陣中的"."表示某一基因在某一細胞中沒有檢測到表達,因為scRNA-seq的表達矩陣中會存在很多表達量為0的數據,Seurat會使用稀疏矩陣進行數據的存儲,可以有效減少存儲的內存。

標準的數據預處理流程

Seurat可以對原始count矩陣進行質量控制,選擇特定的質控條件進行細胞和基因的過濾,質控的一般指標包括:

  • 每個細胞中能檢測到的基因數目:
    1)低質量的細胞或空的droplets中能檢測到的基因比較少
    2)細胞doublets 或者 multiplets 中會檢測到比較高數目的基因count數
  • 每個細胞內能檢測到的分子數
    1)細胞內檢測到的線粒體基因的比例
    2)低質量/死細胞中通常會有比較高的線粒體污染

在Seurat中可以使用PercentageFeatureSet函數計算每個細胞中線粒體的含量:在人類參考基因中線粒體基因是以“MT-”開頭的,而在小鼠中是以“mt-”開頭的。

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
orig.ident    nCount_RNA  nFeature_RNA    percent.mt
AAACATACAACCAC    pbmc3k  2419    779 3.0177759
AAACATTGAGCTAC    pbmc3k  4903    1352    3.7935958
AAACATTGATCAGC    pbmc3k  3147    1129    0.8897363
AAACCGTGCTTCCG    pbmc3k  2639    960 1.7430845
AAACCGTGTATGCG    pbmc3k  980 521 1.2244898

可視化QC指標,并用它們來過濾細胞:
1)將unique基因count數超過2500,或者小于200的細胞過濾掉
2)把線粒體含量超過5%以上的細胞過濾掉

# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
image.png

我們還可以使用FeatureScatter函數來對不同特征-特征之間的關系進行可視化

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
image.png

根據QC指標進行細胞和基因的過濾

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

數據的歸一化

默認情況下,Seurat使用global-scaling的歸一化方法,稱為“LogNormalize”,這種方法是利用總的表達量對每個細胞里的基因表達值進行歸一化,乘以一個scale factor(默認值是10000),再用log轉換一下。歸一化后的數據存放在pbmc[["RNA"]]@data里。

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

鑒定高可變基因(特征選擇)

Seurat使用FindVariableFeatures函數鑒定高可變基因,這些基因在PBMC不同細胞之間的表達量差異很大(在一些細胞中高表達,在另一些細胞中低表達)。默認情況下,會返回2,000個高可變基因用于下游的分析,如PCA等。

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
image.png

數據的標準化

Seurat使用ScaleData函數對歸一化后的count矩陣進行一個線性的變換(“scaling”),將數據進行標準化:
1)shifting每個基因的表達,使細胞間的平均表達為0
2)scaling每個基因的表達,使細胞間的差異為1
ScaleData默認對之前鑒定到的2000個高可變基因進行標準化,也可以通過vars.to.regress參數指定其他的變量對數據進行標準化,表達矩陣進行scaling后,其結果存儲在pbmc[["RNA"]]@scale.data中。

pbmc <- ScaleData(pbmc)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")

進行PCA線性降維

Seurat使用RunPCA函數對標準化后的表達矩陣進行PCA降維處理,默認情況下,只對前面選出的2000個高可變基因進行線性降維,也可以通過feature參數指定想要降維的數據集。

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

Seurat可以使用VizDimReduction, DimPlot, 和DimHeatmap函數對PCA的結果進行可視化

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
image.png
DimPlot(pbmc, reduction = "pca")
image.png
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
image.png
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
image.png

選擇PCA降維的維數用于后續(xù)的分析

Seurat可以使用兩種方法確定PCA降維的維數用于后續(xù)的聚類分析:

  • 使用JackStrawPlot函數
    使用JackStraw函數計算每個PC的P值的分布,顯著的PC會有較低的p-value
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 使用JackStrawPlot函數進行可視化
JackStrawPlot(pbmc, dims = 1:15)
image.png
  • 使用ElbowPlot函數
    使用ElbowPlot函數查看在哪一個PC處出現(xiàn)平滑的掛點:
ElbowPlot(pbmc)
image.png

細胞的聚類分群

Seurat使用圖聚類的方法對降維后的表達數據進行聚類分群。
Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638 
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG 
##              1              3              1              2              6 
## Levels: 0 1 2 3 4 5 6 7 8

非線性降維可視化(UMAP/tSNE)

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.

# If you haven't installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')
# UMAP降維可視化
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = "umap")
image.png
#tSNE降維可視化
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne", label = TRUE)
image.png

鑒定不同類群之間的差異表達基因

Seurat使用FindMarkersFindAllMarkers函數進行差異表達基因的篩選,可以通過test.use參數指定不同的差異表達基因篩選的方法。

# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
       p_val    avg_logFC   pct.1   pct.2   p_val_adj
IL32    0   0.8373872   0.948   0.464   0
LTB    0   0.8921170   0.981   0.642   0
CD3D    0   0.6436286   0.919   0.431   0
IL7R    0   0.8147082   0.747   0.325   0
LDHB    0   0.6253110   0.950   0.613   0

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
        p_val    avg_logFC   pct.1   pct.2   p_val_adj
FCGR3A    0   2.963144    0.975   0.037   0
IFITM3    0   2.698187    0.975   0.046   0
CFD    0   2.362381    0.938   0.037   0
CD68    0   2.087366    0.926   0.036   0
RP11-290F20.3    0   1.886288    0.840   0.016   0

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
p_val    avg_logFC   pct.1   pct.2   p_val_adj   cluster gene
0    0.7300635   0.901   0.594   0   0   LDHB
0    0.9219135   0.436   0.110   0   0   CCR7
0    0.8921170   0.981   0.642   0   1   LTB
0    0.8586034   0.422   0.110   0   1   AQP3
0    3.8608733   0.996   0.215   0   2   S100A9
0    3.7966403   0.975   0.121   0   2   S100A8
0    2.9875833   0.936   0.041   0   3   CD79A
0    2.4894932   0.622   0.022   0   3   TCL1A
0    2.1220555   0.985   0.240   0   4   CCL5
0    2.0461687   0.587   0.059   0   4   GZMK
0    2.2954931   0.975   0.134   0   5   FCGR3A
0    2.1388125   1.000   0.315   0   5   LST1
0    3.3462278   0.961   0.068   0   6   GZMB
0    3.6898996   0.961   0.131   0   6   GNLY
0    2.6832771   0.812   0.011   0   7   FCER1A
0    1.9924275   1.000   0.513   0   7   HLA-DPB1
0    5.0207262   1.000   0.010   0   8   PF4
0    5.9443347   1.000   0.024   0   8   PPBP

Marker基因的可視化

Seurat可以使用VlnPlot,F(xiàn)eaturePlot,RidgePlot,DotPlot和DoHeatmap等函數對marker基因的表達進行可視化

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
image.png
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
image.png
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
image.png
RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
image.png
DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
image.png
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
image.png

對聚類后的不同類群進行注釋

我們可以基于已有的生物學知識,根據一些特異的marker基因不細胞類群進行注釋

Cluster ID    Markers Cell Type
0    IL7R, CCR7  Naive CD4+ T
1    IL7R, S100A4    Memory CD4+
2    CD14, LYZ   CD14+ Mono
3    MS4A1   B
4    CD8A    CD8+ T
5    FCGR3A, MS4A7   FCGR3A+ Mono
6    GNLY, NKG7  NK
7    FCER1A, CST3    DC
8    PPBP    Platelet

以上是根據PBMC中不同細胞類群的一些marker基因進行細胞分群注釋的結果

# 根據marker基因進行分群注釋
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
# 細胞分群的重命名
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
image.png
# 保存分析的結果
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: defaultBLAS/LAPACK: /usr/lib64/R/lib/libRblas.solocale: 
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages: 
[1] splines   parallel  stats4    grid      stats     graphics  grDevices 
[8] utils     datasets  methods   base     
other attached packages: 
[1] loomR_0.2.1.9000            hdf5r_1.3.1                 
[3] R6_2.4.0                    mclust_5.4.5                
[5] garnett_0.1.16              org.Hs.eg.db_3.8.2          
[7] AnnotationDbi_1.46.0        pagedown_0.9.1              
[9] devtools_2.1.0              usethis_1.5.1              
[11] celaref_1.2.0               scran_1.12.1               
[13] scRNAseq_1.10.0             FLOWMAPR_1.2.0             
[15] Seurat_3.1.4.9902           sctransform_0.2.0          
[17] patchwork_0.0.1             cowplot_1.0.0              
[19] DT_0.12                     RColorBrewer_1.1-2         
[21] shinydashboard_0.7.1        ggsci_2.9                  
[23] shiny_1.3.2                 stxBrain.SeuratData_0.1.1  
[25] pbmcsca.SeuratData_3.0.0    panc8.SeuratData_3.0.2     
[27] SeuratData_0.2.1            doParallel_1.0.14          
[29] iterators_1.0.12            binless_0.15.1             
[31] RcppEigen_0.3.3.5.0         foreach_1.4.7              
[33] scales_1.0.0                dplyr_0.8.3                
[35] pagoda2_0.1.0               harmony_1.0                
[37] Rcpp_1.0.2                  conos_1.1.2        
Seurat Object對象解讀.png

思維導圖鏈接?:https://mubu.com/doc/6kVetytOc27

參考來源:https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容