Seurat-Tutorial part3學(xué)習(xí)

Part2 中我們對進(jìn)行歸一化的數(shù)據(jù)進(jìn)行了降維處理,接下來就是對數(shù)據(jù)進(jìn)行以下的操作了

確定數(shù)據(jù)集的維數(shù)

我們要確定有多少個Principal Components(PCs),鑒定significant PCs方法是----那些富集了低 p-value的Features

這里使用了2個函數(shù)來幫助我們挑選合適的PCs數(shù)
  • 1.JackStrawPlot()

We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.
我們隨機(jī)置換數(shù)據(jù)的一部分(默認(rèn)為1%),然后重新運(yùn)行PCA,以構(gòu)建feature Score的“零分布”,然后重復(fù)此過程。 我們將“重要”的PC識別為具有豐富的低p-value特征基因的PC。特征重要性是給數(shù)據(jù)中每個特征打一個分?jǐn)?shù),分?jǐn)?shù)越高說明特征越重要。通過使用模型的要素重要性屬性,可以獲取數(shù)據(jù)集每個要素的要素重要性。

# NOTE: This process can take a long time for big datasets, comment out for expediency. 
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce computation time
#推測這個方法計算耗時,相對來說要更準(zhǔn)確些,但是可以使用ElbowPlot做近似計算
pbmc <- JackStraw(pbmc, num.replicate = 100)
  #|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05m 39s
 # |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)#

https://satijalab.org/seurat/reference/jackstraw

JackStraw(
  object,#Seurat object
  reduction = "pca",
  assay = NULL,
  dims = 20,#Number of PCs to compute significance for
  num.replicate = 100,#Number of replicate samplings to perform
  prop.freq = 0.01,
  verbose = TRUE,
  maxit = 1000
)
對上面一步得到的pbmc可視化和結(jié)果分析
JackStrawPlot(pbmc, dims = 1:15)

JackStrawPlot-dim選擇15

JackStrawPlot()函數(shù)提供了一個可視化工具,用于比較挑選了的PC和均勻分布的PC之間的差異。dim 15的時候,發(fā)現(xiàn)是PC在前10~12之后,重要性顯著下降。 對于dim 20的情況下,增加到15之后,基本上是低于這個均勻分別虛線值的(原因不明白)

虛線:PC的p-values是均勻分布的,沒有進(jìn)行挑選

JackStrawPlot(pbmc, dims = 1:20)#這里的dims參數(shù)必須是小于JackStraw函數(shù)計算的PCs數(shù),默認(rèn)是20,最多是20,可以自行設(shè)置
JackStrawPlot-dim選擇20
  • 2.ElbowPlot
    使用這個函數(shù)可以替代上面一個函數(shù)的功能,得到一個大概的主成分?jǐn)?shù)量值
ElbowPlot(pbmc)#pbmc數(shù)據(jù)是來自之前RunPCA函數(shù)運(yùn)行后的數(shù)據(jù)
#a ranking of principle components based on the percentage of variance explained by each one
#we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs

ElbowPlot

鑒定數(shù)據(jù)集的真實(shí)維度是很不確定的,Seurat的作者是建議考慮3種方法。第一個受到更多監(jiān)督,探索PC以確定異質(zhì)性的相關(guān)來源,例如可以與GSEA結(jié)合使用。 第二種實(shí)現(xiàn)基于隨機(jī)空模型的統(tǒng)計測試,但是對于大型數(shù)據(jù)集而言非常耗時,并且可能無法返回清晰的PC截止值。 第三種是常用的啟發(fā)式方法,可以立即進(jìn)行計算。

這里作者是選擇了10個 ,但是建議考慮下面幾個因素

  1. 樹突狀細(xì)胞和NK愛好者或許能認(rèn)知到一點(diǎn):與這兩個維度(PCs12和13)的主成分密切相關(guān)的基因 或許能定義罕見的免疫亞群(如MZB1這個基因就是漿細(xì)胞樣DC的Marker),但是這些都特別少于,在沒有先驗(yàn)知識的情況下,很難從背景噪音中區(qū)分
  2. 建議使用不同數(shù)量的PC(10~50)重復(fù)進(jìn)行下游分析,結(jié)果通常沒有太大的區(qū)別
  3. 建議使用更多的dim進(jìn)行分析。對本例子,如果只選擇10PCs進(jìn)行下游分析,則會對結(jié)果更不利

Cluster the cells

  • we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using theFindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
pbmc <- FindNeighbors(pbmc, dims = 1:10) #這個數(shù)值是之前所選擇的PCs數(shù)量
##Computing nearest neighbor graph
##Computing SNN
  • To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM http://dx.doi.org/10.1088/1742-5468/2008/10/P10008
    to iteratively group cells together, with the goal of optimizing the standard modularity function
pbmc <- FindClusters(pbmc, resolution = 0.5)#resolution值越大,會得到更多的Clusters;對于數(shù)目是3K個細(xì)胞的數(shù)據(jù)集來說設(shè)置為0.4~1.2之間,均能返回良好結(jié)果
##Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

##Number of nodes: 2638
##Number of edges: 95965

##Running Louvain algorithm...
##0%   10   20   30   40   50   60   70   80   90   100%
##[----|----|----|----|----|----|----|----|----|----|
#**************************************************|
##Maximum modularity in 10 random starts: 0.8723
##Number of communities: 9
##Elapsed time: 0 seconds

The clusters can be found using the Idents() function.

head(Idents(pbmc), 5)
AAACATACAACCAC-1 
               2 
AAACATTGAGCTAC-1 
               3 
AAACATTGATCAGC-1 
               2 
AAACCGTGCTTCCG-1 
               1 
AAACCGTGTATGCG-1 
               6 
Levels: 0 1 2 3 4 5 6 7 8

非線性降維 (UMAP/tSNE)

The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
把上一步基于圖的聚類中的細(xì)胞共位于這些降維圖中。作為UMAP 和 tSNE的輸入數(shù)據(jù),是建議使用 Cluster the cells 那一步得到的PCs來分析聚類;圖上每個點(diǎn)代表一個細(xì)胞。

pbmc <- RunUMAP(pbmc, dims = 1:10, label = "T")
21:34:50 UMAP embedding parameters a = 0.9922 b = 1.112
21:34:50 Read 2638 rows and found 10 numeric columns
21:34:50 Using Annoy for neighbor search, n_neighbors = 30
21:34:50 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:34:50 Writing NN index file to temp file C:\Users\ADMINI~1\AppData\Local\Temp\RtmpKYasTJ\file260382e791d
21:34:50 Searching Annoy index using 1 thread, search_k = 3000
21:34:52 Annoy recall = 100%
21:34:52 Commencing smooth kNN distance calibration using 1 thread
21:34:52 Initializing from normalized Laplacian + noise
21:34:53 Commencing optimization for 500 epochs, with 105124 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:35:03 Optimization finished

UMAP

https://satijalab.org/seurat/reference/runumap
RunUMAP 函數(shù)使用

RunUMAP(object, ...)

# S3 method for default
RunUMAP(
  object,
  reduction.key = "UMAP_",
  assay = NULL,
  reduction.model = NULL,
  return.model = FALSE,
  umap.method = "uwot",
  n.neighbors = 30L,
  n.components = 2L,
  metric = "cosine",
  n.epochs = NULL,
  learning.rate = 1,
  min.dist = 0.3,
  spread = 1,
  set.op.mix.ratio = 1,
  local.connectivity = 1L,
  repulsion.strength = 1,
  negative.sample.rate = 5,
  a = NULL,
  b = NULL,
  uwot.sgd = FALSE,
  seed.use = 42,###這個可以默認(rèn)下,防止運(yùn)行后圖片有點(diǎn)差異,但不影響整體結(jié)果展示
  metric.kwds = NULL,
  angular.rp.forest = FALSE,
  verbose = TRUE,
  ...
)
另外一種方法非線性降維
pbmc <- RunTSNE(pbmc, dims = 1:10,seed.use = 42, label = "T")
DimPlot(pbmc, reduction = "tsne")
RunTSNE

視覺上給人的感覺是UMAP更加緊湊點(diǎn),細(xì)胞的點(diǎn)更聚集一點(diǎn)

Finding differentially expressed features (cluster biomarkers)

怎么給上面的圖-聚類的clustering進(jìn)行標(biāo)注,也就是怎么找到對應(yīng)的Bio-Marker
Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

使用函數(shù)FindMarkers()

https://satijalab.org/seurat/reference/findmarkers
Finds markers (differentially expressed genes) for identity classes

FindMarkers(object, ...)

# S3 method for default
FindMarkers(
  object,
  slot = "data",
  counts = numeric(),
  cells.1 = NULL,
  cells.2 = NULL,
  features = NULL,
  logfc.threshold = 0.25,
  test.use = "wilcox",
  min.pct = 0.1,
  min.diff.pct = -Inf,
  verbose = TRUE,
  only.pos = FALSE,
  max.cells.per.ident = Inf,
  random.seed = 1,
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  pseudocount.use = 1,
  fc.results = NULL,
  ...
)

counts Count matrix if using scale.data for DE tests. This is used for computing pct.1 and pct.2 and for filtering features based on fraction expressing
logfc.threshold Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Default is 0.25 Increasing logfc.threshold speeds up the function, but can miss weaker signals.
ident.1 Identity class to define markers for; pass an object of class phylo or 'clustertree' to find markers for a node in a cluster tree; passing 'clustertree' requires BuildClusterTree to have been run
min.pct only test genes that are detected in a minimum fraction of min.pct cells in either of the two populations. Meant to speed up the function by not testing genes that are very infrequently expressed. Default is 0.1
only.pos Only return positive markers (FALSE by default)

對于本例子,使用這個來找Marker
  • 比如可以找下cluster2 的marker
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
    ##   p_val      avg_log2FC pct.1 pct.2   p_val_adj
##IL32 2.593535e-91  1.2154360 0.949 0.466 3.556774e-87
##LTB  7.994465e-87  1.2828597 0.981 0.644 1.096361e-82
##CD3D 3.922451e-70  0.9359210 0.922 0.433 5.379250e-66
##IL7R 1.130870e-66  1.1776027 0.748 0.327 1.550876e-62
##LDHB 4.082189e-65  0.8837324 0.953 0.614 5.598314e-61
     
  • 找到cluster5和cluster0~3之間的差異基因
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=27s 
head(cluster5.markers, n = 5)
head(cluster5.markers, n = 5)
##                      p_val avg_log2FC pct.1 pct.2     p_val_adj
##FCGR3A        2.150929e-209   4.267579 0.975 0.039 2.949784e-205
##IFITM3        6.103366e-199   3.877105 0.975 0.048 8.370156e-195
##CFD           8.891428e-198   3.411039 0.938 0.037 1.219370e-193
##CD68          2.374425e-194   3.014535 0.926 0.035 3.256286e-190
##RP11-290F20.3 9.308287e-191   2.722684 0.840 0.016 1.276538e-186
  • 找所有的Clusters 的 Marker
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)# Only return positive markers 
##Calculating cluster 0
##For a more efficient implementation of the Wilcoxon Rank Sum Test,
##(default method for FindMarkers) please install the limma package
--------------------------------------------
install.packages('BiocManager')
BiocManager::install('limma')
--------------------------------------------
##After installation of limma, Seurat will automatically use the more efficient implementation (no further action necessary).
##This message will be shown once per session
  ##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=10s  
##Calculating cluster 1
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=21s  
##Calculating cluster 2
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s  
##Calculating cluster 3
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=09s  
##Calculating cluster 4
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s  
##Calculating cluster 5
 ## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=30s  
##Calculating cluster 6
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=24s  
##Calculating cluster 7
##  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=32s  
##Calculating cluster 8
  ##|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=19s


對得出的pbmc.markers數(shù)據(jù)進(jìn)行查看

pbmc.markers
table(pbmc.markers$cluster)

#  0   1   2   3   4   5   6   7   8 
#163 391 171 147 171 608 363 633 242 

%>%數(shù)據(jù)傳遞

top_n(x, n, wt)
#x A data frame
#n Number of rows to return for top_n(), fraction of rows to return for top_frac(). 
#If n is positive, selects the top rows. If negative, selects the bottom rows. 
#If x is grouped, this is the number (or fraction) of rows per group. Will include more rows if there are ties.
#wt (Optional). The variable to use for ordering. If not specified, defaults to the last variable in the tbl.

首先pbmc.markers按照avg_log2FC對Features進(jìn)行排名,并提取前2行

pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)

得到的結(jié)果是這個


挑選出來的top2基因,每個cluster有2行,總共18行
對某些基因進(jìn)行查看
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
image.png
# you can plot raw counts as well
VlnPlot(pbmc, features = c("MS4A1", "CD79A"), slot = "counts", log = TRUE)
對其counts值畫圖
對于這個圖,"MS4A1"和 "CD79A"這兩個都聚在cluster3中,都能代表cluster3
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

隨意取出的基因Features

對于NKG7這個基因主要存在cluster4,6,中;對于PF4這個feature是聚在cluster9中的,從上述的表格中18行基因第17個PF4中可以看出

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A"))
對這9個基因進(jìn)行cluster映射

把每個cluster的top基因進(jìn)行擴(kuò)大到10個,而非上述的2個

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)#和上述的比較類似,n=10
dim(top10)
#[1] 90  7
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
top10-cluster0-CCR7

top10-cluster1-CD14,LYZ

top10-cluster2-IL7R

top10-cluster3-MS4A1

top10-cluster4-CD8A

top10-cluster5-FCGR3A

top10-cluster6-GNLY

top10-cluster7-FCER1A

top10-cluster8-PPBP
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 CD14, LYZ CD14+ Mono
2 IL7R, S100A4 Memory CD4+
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet

特別幸運(yùn)的是,上述9個cluster的top10 gene都能找到對應(yīng)的Cell Type
疑問是有同一個gene對應(yīng)2個cluster,但是僅僅代表一個 Cell type該怎么選擇呢

NKG7

DoHeatmap(pbmc, features = top10$gene) + NoLegend()
對top10挑選出的90個基因進(jìn)行熱圖繪制

最后一步,將對應(yīng)的Cluster和Cell Type對應(yīng)在圖上

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
沒有對Cluster ID和細(xì)胞類型對映的圖
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
Cluster ID和細(xì)胞類型對映的圖

https://www.cell.com/fulltext/S0092-8674(15)00549-8
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
https://blog.csdn.net/weixin_39549312/article/details/111391539

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容