單細胞轉錄組之Seurat包全流程-數(shù)據(jù)過濾、降維分群及可視化

1.Seurat包的安裝及數(shù)據(jù)的準備

1.1 加載需要的包和數(shù)據(jù)

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Seurat")

library(Seurat)
# devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(ggplot2)
library(patchwork)
library(dplyr)
##如果有未安裝的包,運行如下命令安裝:install.packages("包的名字")

1.2 創(chuàng)建Seurat對象

對于之前從CellRanger得到的比對結果,讀取sample/outs/filtered_feature_bc_matrix文件夾下的三個文件:barcodes.tsv(1列,為barcode名);genes.tsv(2列,第1列為ENS編號,第2列為基因名);matrix.mtx(3列,第1列為基因編號,第2列為細胞編號,第3列為對應的reads數(shù))

pbmc.data <- Read10X(data.dir = "sample/outs/filtered_feature_bc_matrix")
##使用未經標準化的數(shù)據(jù)創(chuàng)建Seurat對象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
##project:項目名;過濾條件:min.cells = 3 一個基因最少在3個細胞中檢測到,min.features = 200一個細胞中最少檢測到200個基因

1.3 認識Seurat對象的結構

Seurat對象中儲存了關于這個單細胞項目的幾乎所有信息,包括每個細胞的barcodes,原始表達矩陣以及運行過哪些分析等,后續(xù)對單細胞的分群注釋等信息都是保存在Seurat對象中的。在R中可以使用str函數(shù)查看數(shù)據(jù)結構。

str(pbmc)

Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 1
  .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
  .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:2282976] 29 73 80 148 163 184 186 227 229 230 ...
  .. .. .. .. .. ..@ p       : int [1:2701] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 13714 2700
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ...
  .. .. .. .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
  .. .. .. .. .. ..@ x       : num [1:2282976] 1 1 2 1 1 1 1 41 1 1 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:2282976] 29 73 80 148 163 184 186 227 229 230 ...
  .. .. .. .. .. ..@ p       : int [1:2701] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 13714 2700
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ...
  .. .. .. .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
  .. .. .. .. .. ..@ x       : num [1:2282976] 1 1 2 1 1 1 1 41 1 1 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..@ scale.data   : num[0 , 0 ] 
  .. .. .. ..@ key          : chr "rna_"
  .. .. .. ..@ assay.orig   : NULL
  .. .. .. ..@ var.features : logi(0) 
  .. .. .. ..@ meta.features:'data.frame':  13714 obs. of  0 variables
  .. .. .. ..@ misc         : list()
  ..@ meta.data   :'data.frame':    2700 obs. of  4 variables:
  .. ..$ orig.ident  : Factor w/ 1 level "pbmc3k": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ nCount_RNA  : num [1:2700] 2419 4903 3147 2639 980 ...
  .. ..$ nFeature_RNA: int [1:2700] 779 1352 1129 960 521 781 782 790 532 550 ...
  .. ..$ percent.mt  : num [1:2700] 3.02 3.79 0.89 1.74 1.22 ...
  ..@ active.assay: chr "RNA"
  ..@ active.ident: Factor w/ 1 level "pbmc3k": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
  ..@ graphs      : list()
  ..@ neighbors   : list()
  ..@ reductions  : list()
  ..@ images      : list()
  ..@ project.name: chr "pbmc3k"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 4 1 0
  ..@ commands    : list()
  ..@ tools       : list()

以PBMC為例,訪問Seurat對象中的數(shù)據(jù)

輸入pbmc@會自動彈出Seurat對象的二級結構


訪問Seurat對象

我們訪問最多的就是assays和meta.data

pbmc@assays

$RNA
Assay data with 13714 features for 2700 cells
First 10 features:
 AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115,
NOC2L, KLHL17, PLEKHN1, RP11-54O7.17, HES4 
##pbmc@assays$RNA中包含著2700個細胞的13714個基因的信息,包括原始counts(pbmc@assays$RNA@counts)

head(pbmc@meta.data)
                 orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551

#pbmc[[]] 可進入meta.data 相當于pbmc@meta.data,,對meta.data下級數(shù)據(jù)調用方式跟數(shù)據(jù)框取列類似,pbmc@meta.data$percent.mt,另外pbmc[["percent.mt"]]也可以表示選擇percent.mt這一列

2.數(shù)據(jù)的過濾

2.1 確定數(shù)據(jù)過濾條件

#計算線粒體基因占總基因數(shù)目的百分比,pattern后可以是正則表達式。
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

#小提琴圖展示meta.data中的數(shù)據(jù),ncol 表示顯示多個圖時的列數(shù)
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
nFeature_RNA, nCount_RNA, percent.mt含量

可以看到基本上線粒體基因都在5%以下,nFeature基本在200-2000之間,我們可以根據(jù)圖示結果篩選去除線粒體基因和選擇出基因的表達量區(qū)間。

2.2 檢查數(shù)據(jù)的可用性

FeatureScatter 可視化meta.data中數(shù)據(jù)的關系,其中細胞的顏色由其identity class決定,圖上方數(shù)字表示其皮爾遜相關系數(shù)

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot1 + plot2
元素相關性

可以看到nCount_RNA和線粒體基因間無相關,表明測序得到的Count基本都是細胞的功能基因,nCount_RNA和nFeature_RNA強相關,符合邏輯。

2.3 數(shù)據(jù)過濾

subset函數(shù)用于取子集,subset(x, subset, select, drop = FALSE, ...),x表示操作對象,subset表示所取子集的邏輯值

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

3.數(shù)據(jù)的標準化

Seurat包中自帶了標準化函數(shù),NormalizeData。

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 1e4) 

#normalization.method有LogNormalize,CLR,RC三種。
#LogNormalize:每個細胞的基因數(shù)數(shù)除以該細胞的總基因數(shù),再乘以scale.factor。然后使用log(x+1)進行自然對數(shù)轉換。
#CLR:應用居中的對數(shù)比率變換
#RC: 相對計數(shù)。每個細胞的功能計數(shù)除以該細胞的總計數(shù),再乘以scale.factor。沒有應用log轉換。對于每百萬計數(shù)(CPM)設置規(guī)模。系數(shù)= 1 e6

4.識別差異基因

#選取2000個差異基因
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)

#selection.method:vst,mean.var.plot (mvp),dispersion (disp)。
#nfeatures:選擇差異基因的數(shù)量;僅在selection.method設置為“dispersion”或“vst”時使用。

# 選擇前10的差異基因
top10 <- head(VariableFeatures(pbmc), 10)
# 帶標簽與否的展示差異基因
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
高變基因

根據(jù)項目背景需要,決定是否用ScaleData 函數(shù)去除細胞周期或線粒體基因的影響,若差異基因中出現(xiàn)線粒體基因或者細胞周期相關基因且與項目背景無關,可以選擇使用此函數(shù)。

#消除線粒體基因的影響
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")

5.降維聚類分群

5.1根據(jù)差異基因進行主成分分析

pbmc <- RunPCA(pbmc, features = VariableFeatures( pbmc))  

5.2計算給定數(shù)據(jù)的近鄰

FindNeighbors函數(shù)計算給定數(shù)據(jù)集的k.param近鄰。也可以選擇(compute.SNN),通過計算每個cell與其k.param近鄰之間的鄰域重疊(Jaccard index)來構造一個共享近鄰圖。參考鏈接:Seurat識別細胞類群的原理(FindNeighbors和FindClusters),FindNeighbors {Seurat}

#dim輸入降維的維度,resolution分辨率,判斷近鄰距離大小,值越低,聚類越少。

pbmc <- FindNeighbors(pbmc, dims = 1:10)

?FindNeighbors
Description:
    Constructs a Shared Nearest Neighbor (SNN) Graph for a given
     dataset. We first determine the k-nearest neighbors of each cell.
     We use this knn graph to construct the SNN graph by calculating
     the neighborhood overlap (Jaccard index) between every cell and
     its k.param nearest neighbors.
這個地方說明,這個函數(shù)首先是計算每個細胞的KNN,也就是計算每個細胞之間的相互距離,依據(jù)細胞之間距離的graph來構建snn graph(依據(jù)細胞之間“鄰居”的overlop)
這里有三個問題:1、knn是什么,2、Jaccard index又是什么  3、鄰居的判定
我們來看看參數(shù)(主要參數(shù)):
distance.matrix: Boolean value of whether the provided matrix is a
          distance matrix; note, for objects of class ‘dist’, this
          parameter will be set automatically
      這個參數(shù)我們通常不會設置,但是默認是TRUE。
k.param: Defines k for the k-nearest neighbor algorithm
      這個參數(shù)就是用來定義最相近的幾個細胞作為鄰居,默認是20
compute.SNN: also compute the shared nearest neighbor graph
     計算共享鄰居的數(shù)量,一般不設置
prune.SNN: Sets the cutoff for acceptable Jaccard index when computing
          the neighborhood overlap for the SNN construction. Any edges
          with values less than or equal to this will be set to 0 and
          removed from the SNN graph. Essentially sets the strigency of
          pruning (0 - no pruning, 1 - prune everything).
     在計算SNN構造的鄰域重疊時,為可接受的Jaccard index設置截止值。 值小于或等于此值的任何邊將被設置為0并從SNN圖中刪除。 本質上設置修剪的嚴格程度(0-不修剪,1-修剪所有內容)。
nn.method: Method for nearest neighbor finding. Options include: rann,
          annoy
      這個參數(shù)提供了如何判斷鄰居的方法,提供的可選是rann和annoy,**這個我們在后面討論**。
      annoy.metric: Distance metric for annoy. Options include: euclidean,
          cosine, manhattan, and hamming
      (annoy距離的方式)
force.recalc: Force recalculation of SNN
      SNN強制重新計算,一般不設置

5.3依據(jù)FindNeighbors函數(shù)的結果,計算分群聚類

pbmc <- FindClusters(pbmc, resolution = 0.5)

?FindClusters
Description:

     Identify clusters of cells by a shared nearest neighbor (SNN)
     modularity optimization based clustering algorithm. First
     calculate k-nearest neighbors and construct the SNN graph. Then
     optimize the modularity function to determine clusters. For a full
     description of the algorithms, see Waltman and van Eck (2013) _The
     European Physical Journal B_. Thanks to Nigel Delaney
     (evolvedmicrobe@github) for the rewrite of the Java modularity
     optimizer code in Rcpp!
這個說明,依據(jù)SNN來識別類群,當然算法很復雜,我們可以參考給的網(wǎng)址。
我們來看看主要參數(shù)
resolution: Value of the resolution parameter, use a value above
          (below) 1.0 if you want to obtain a larger (smaller) number
          of communities.
這個參數(shù)可以理解為清晰度,值越低,可以容納更少的共享鄰居數(shù)量,聚類數(shù)也會變少。
modularity.fxn: 計算模塊系數(shù)函數(shù),1為標準函數(shù);2為備選函數(shù),這里沒有具體說明是什么函數(shù),我認為1是上面提到的Kronecker delta函數(shù)。
method: Method for running leiden (defaults to matrix which is fast
          for small datasets). Enable method = "igraph" to avoid
          casting large data to a dense matrix.
      這個參數(shù)表示leiden算法的計算方式,(我對算法是小白~,求大神告知)
algorithm: 模塊系數(shù)優(yōu)化算法,1使用原始Louvain算法;2使用Louvain algorithm with multilevel refinement;3使用SLM算法;4使用Leiden算法(注:4需要額外安裝插件)
n.start: 隨機開始的數(shù)量,默認是10
random.seed: 隨機數(shù)種子,默認是0

5.4查看前5個細胞的聚類ID

head(Idents(pbmc), 5)

AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 
               0                3                2 
AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1 
               1                6 
Levels: 0 1 2 3 4 5 6 7 8

5.5計算每個聚類包含的細胞數(shù)

table(pbmc$seurat_clusters) 

  0   1   2   3   4   5   6   7   8 
709 480 429 342 316 162 154  32  14 

5.6降維

pbmc <- RunUMAP(pbmc, dims = 1:10)

#note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters

DimPlot(pbmc, reduction = 'umap',label = TRUE)

降維過后會得到如下所示的分群情況:


降維

6.單細胞亞群注釋

此時需要根據(jù)生物學相關知識,可視化各單細胞亞群的標記基因,并根據(jù)其marker確定細胞分群??筛鶕?jù)參考文獻或者這個數(shù)據(jù)庫。

#FindMarkers查找亞群的標記基因
# 發(fā)現(xiàn)聚類1的所有biomarkers
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# 查找將聚類1與聚類2和3區(qū)分的所有標記基因
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(2, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# 與所有其他亞群相比,找到每個亞群的標記,僅報告陽性細胞
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10markers<-pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)

cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

DoHeatmap(pbmc , features = top10markers$gene, size = 3)
亞群marker基因

根據(jù)熱圖結果,在CellMarker網(wǎng)站可查詢,熱圖中分不清楚的,還可以單獨對該亞群運行FindMarkers函數(shù)后作圖查詢。結果如下

Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 IL7R, S100A4 Memory CD4+
2 CD14, LYZ CD14+ Mono
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", 
"LYZ", "PPBP", "CD8A"))
標記基因映射圖

6.1更改亞群注釋

#新id與之前id(0-8)一一對應
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T","B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")

names(new.cluster.ids) <- levels(pbmc)
#更改seuratd對象中的Idents
pbmc <- RenameIdents(pbmc, new.cluster.ids)

DimPlot(pbmc, reduction = 'umap', 
        label = TRUE, pt.size = 0.5) + NoLegend()

6.2差異基因在不同亞群中可視化

##以下features可換成任意基因
##小提琴圖
VlnPlot(pbmc, features = c("MS4A1", "CD79A")) 

##坐標映射圖
FeaturePlot(pbmc, features = c("MS4A1", "CD79A"))

##峰巒圖
RidgePlot(pbmc, features = c("MS4A1", "CD79A"), ncol = 1)


features= c('IL7R', 'CCR7','CD14', 'LYZ',  'IL7R', 'S100A4',"MS4A1", "CD8A",'FOXP3',
            'FCGR3A', 'MS4A7', 'GNLY', 'NKG7',
            'FCER1A', 'CST3','PPBP')

##氣泡圖
DotPlot(pbmc, features = unique(features)) + RotatedAxis()

#downsample 在每個細胞亞群中均抽樣100個細胞做熱圖
DoHeatmap(subset(pbmc, downsample = 100), 
          features = features, size = 3)
小提琴圖
坐標映射圖
峰巒圖
氣泡圖
熱圖

參考來源
Seurat 包圖文詳解 | 單細胞轉錄組(scRNA-seq)分析02_白墨石的博客-CSDN博客_seurat單細胞測序
Seurat識別細胞類群的原理(FindNeighbors和FindClusters) - 簡書 (jianshu.com)

致謝
I thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

THE END

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容