Seuratis an R package designed forQC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data.

Seurat3新增功能特色:
Improved and expanded methods for single-cell integration. Seurat v3 implements new methods to identify ‘anchors’ across diverse single-cell data types, in order to construct harmonized references, or to transfer information across experiments.
Improved methods for normalization. Seurat v3 includes support for sctransform, a new modeling approach for the normalization of single-cell data, described in a second preprint. Compared to standard log-normalization, sctransform effectively removes technically-driven variation while preserving biological heterogeneity.
An efficiently restructured Seurat object, with an emphasis on multi-modal data. We have carefully re-designed the structure of the Seurat object, with clearer documentation, and a flexible framework to easily switch between RNA, protein, cell hashing, batch-corrected / integrated, or imputed data.
安裝并加載所需的R包
目前最新版的Seurat包都是基于3.0以上的,可以直接通過install.packages命令進行安裝。
# 設置R包安裝鏡像
options(repos="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
install.packages('Seurat')
library(Seurat)
library(dplyr)
library(patchwork)
# 查看Seurat包的版本信息
packageVersion("Seurat")
如果想安裝早期2.0版本的Seurat包可以通過devtools包進行安裝
# 先安裝devtools包
install.packages('devtools')
# 使用devtools安裝指定版本的Seurat包
devtools::install_version(package = 'Seurat', version = package_version('2.3.4'))
library(Seurat)
構建Seurat對象
本教程使用的是來自10X Genomics平臺測序的外周血單核細胞(PBMC)數據集,這個數據集是用Illumina NextSeq 500平臺進行測序的,里面包含了2,700個細胞的RNA-seq數據。
這個原始數據是用CellRanger軟件進行基因表達定量處理的,最終生成了一個UMI的count矩陣。矩陣里的列是不同barcode指示的細胞,行是測序到的不同基因。接下來,我們將使用這個count矩陣創(chuàng)建Seurat對象。
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "/home/dongwei/scRNA-seq/data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
# 初步過濾:每個細胞中至少檢測到200個基因,每個基因至少在3個細胞中表達
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features)
查看原始count矩陣的信息
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
表達矩陣中的"."表示某一基因在某一細胞中沒有檢測到表達,因為scRNA-seq的表達矩陣中會存在很多表達量為0的數據,Seurat會使用稀疏矩陣進行數據的存儲,可以有效減少存儲的內存。
標準的數據預處理流程
Seurat可以對原始count矩陣進行質量控制,選擇特定的質控條件進行細胞和基因的過濾,質控的一般指標包括:
- 每個細胞中能檢測到的基因數目:
1)低質量的細胞或空的droplets中能檢測到的基因比較少
2)細胞doublets 或者 multiplets 中會檢測到比較高數目的基因count數 - 每個細胞內能檢測到的分子數
1)細胞內檢測到的線粒體基因的比例
2)低質量/死細胞中通常會有比較高的線粒體污染
在Seurat中可以使用PercentageFeatureSet函數計算每個細胞中線粒體的含量:在人類參考基因中線粒體基因是以“MT-”開頭的,而在小鼠中是以“mt-”開頭的。
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG pbmc3k 980 521 1.2244898
可視化QC指標,并用它們來過濾細胞:
1)將unique基因count數超過2500,或者小于200的細胞過濾掉
2)把線粒體含量超過5%以上的細胞過濾掉
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

我們還可以使用FeatureScatter函數來對不同特征-特征之間的關系進行可視化
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

根據QC指標進行細胞和基因的過濾
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
數據的歸一化
默認情況下,Seurat使用global-scaling的歸一化方法,稱為“LogNormalize”,這種方法是利用總的表達量對每個細胞里的基因表達值進行歸一化,乘以一個scale factor(默認值是10000),再用log轉換一下。歸一化后的數據存放在pbmc[["RNA"]]@data里。
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
鑒定高可變基因(特征選擇)
Seurat使用FindVariableFeatures函數鑒定高可變基因,這些基因在PBMC不同細胞之間的表達量差異很大(在一些細胞中高表達,在另一些細胞中低表達)。默認情況下,會返回2,000個高可變基因用于下游的分析,如PCA等。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

數據的標準化
Seurat使用ScaleData函數對歸一化后的count矩陣進行一個線性的變換(“scaling”),將數據進行標準化:
1)shifting每個基因的表達,使細胞間的平均表達為0
2)scaling每個基因的表達,使細胞間的差異為1
ScaleData默認對之前鑒定到的2000個高可變基因進行標準化,也可以通過vars.to.regress參數指定其他的變量對數據進行標準化,表達矩陣進行scaling后,其結果存儲在pbmc[["RNA"]]@scale.data中。
pbmc <- ScaleData(pbmc)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
進行PCA線性降維
Seurat使用RunPCA函數對標準化后的表達矩陣進行PCA降維處理,默認情況下,只對前面選出的2000個高可變基因進行線性降維,也可以通過feature參數指定想要降維的數據集。
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
Seurat可以使用VizDimReduction, DimPlot, 和DimHeatmap函數對PCA的結果進行可視化
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot(pbmc, reduction = "pca")

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

選擇PCA降維的維數用于后續(xù)的分析
Seurat可以使用兩種方法確定PCA降維的維數用于后續(xù)的聚類分析:
- 使用JackStrawPlot函數
使用JackStraw函數計算每個PC的P值的分布,顯著的PC會有較低的p-value
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 使用JackStrawPlot函數進行可視化
JackStrawPlot(pbmc, dims = 1:15)

- 使用ElbowPlot函數
使用ElbowPlot函數查看在哪一個PC處出現(xiàn)平滑的掛點:
ElbowPlot(pbmc)

細胞的聚類分群
Seurat使用圖聚類的方法對降維后的表達數據進行聚類分群。
Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 96033
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG
## 1 3 1 2 6
## Levels: 0 1 2 3 4 5 6 7 8
非線性降維可視化(UMAP/tSNE)
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')
# UMAP降維可視化
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = "umap")

#tSNE降維可視化
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne", label = TRUE)

鑒定不同類群之間的差異表達基因
Seurat使用FindMarkers和FindAllMarkers函數進行差異表達基因的篩選,可以通過test.use參數指定不同的差異表達基因篩選的方法。
# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
IL32 0 0.8373872 0.948 0.464 0
LTB 0 0.8921170 0.981 0.642 0
CD3D 0 0.6436286 0.919 0.431 0
IL7R 0 0.8147082 0.747 0.325 0
LDHB 0 0.6253110 0.950 0.613 0
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
FCGR3A 0 2.963144 0.975 0.037 0
IFITM3 0 2.698187 0.975 0.046 0
CFD 0 2.362381 0.938 0.037 0
CD68 0 2.087366 0.926 0.036 0
RP11-290F20.3 0 1.886288 0.840 0.016 0
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
0 0.7300635 0.901 0.594 0 0 LDHB
0 0.9219135 0.436 0.110 0 0 CCR7
0 0.8921170 0.981 0.642 0 1 LTB
0 0.8586034 0.422 0.110 0 1 AQP3
0 3.8608733 0.996 0.215 0 2 S100A9
0 3.7966403 0.975 0.121 0 2 S100A8
0 2.9875833 0.936 0.041 0 3 CD79A
0 2.4894932 0.622 0.022 0 3 TCL1A
0 2.1220555 0.985 0.240 0 4 CCL5
0 2.0461687 0.587 0.059 0 4 GZMK
0 2.2954931 0.975 0.134 0 5 FCGR3A
0 2.1388125 1.000 0.315 0 5 LST1
0 3.3462278 0.961 0.068 0 6 GZMB
0 3.6898996 0.961 0.131 0 6 GNLY
0 2.6832771 0.812 0.011 0 7 FCER1A
0 1.9924275 1.000 0.513 0 7 HLA-DPB1
0 5.0207262 1.000 0.010 0 8 PF4
0 5.9443347 1.000 0.024 0 8 PPBP
Marker基因的可視化
Seurat可以使用VlnPlot,F(xiàn)eaturePlot,RidgePlot,DotPlot和DoHeatmap等函數對marker基因的表達進行可視化
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

對聚類后的不同類群進行注釋
我們可以基于已有的生物學知識,根據一些特異的marker基因不細胞類群進行注釋
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 IL7R, S100A4 Memory CD4+
2 CD14, LYZ CD14+ Mono
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
以上是根據PBMC中不同細胞類群的一些marker基因進行細胞分群注釋的結果
# 根據marker基因進行分群注釋
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
# 細胞分群的重命名
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# 保存分析的結果
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: defaultBLAS/LAPACK: /usr/lib64/R/lib/libRblas.solocale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines parallel stats4 grid stats graphics grDevices
[8] utils datasets methods base
other attached packages:
[1] loomR_0.2.1.9000 hdf5r_1.3.1
[3] R6_2.4.0 mclust_5.4.5
[5] garnett_0.1.16 org.Hs.eg.db_3.8.2
[7] AnnotationDbi_1.46.0 pagedown_0.9.1
[9] devtools_2.1.0 usethis_1.5.1
[11] celaref_1.2.0 scran_1.12.1
[13] scRNAseq_1.10.0 FLOWMAPR_1.2.0
[15] Seurat_3.1.4.9902 sctransform_0.2.0
[17] patchwork_0.0.1 cowplot_1.0.0
[19] DT_0.12 RColorBrewer_1.1-2
[21] shinydashboard_0.7.1 ggsci_2.9
[23] shiny_1.3.2 stxBrain.SeuratData_0.1.1
[25] pbmcsca.SeuratData_3.0.0 panc8.SeuratData_3.0.2
[27] SeuratData_0.2.1 doParallel_1.0.14
[29] iterators_1.0.12 binless_0.15.1
[31] RcppEigen_0.3.3.5.0 foreach_1.4.7
[33] scales_1.0.0 dplyr_0.8.3
[35] pagoda2_0.1.0 harmony_1.0
[37] Rcpp_1.0.2 conos_1.1.2

思維導圖鏈接?:https://mubu.com/doc/6kVetytOc27