相信大家在分析數(shù)據(jù)的時(shí)候,都不太清楚聚多少個(gè)類算合理的,都是按照默認(rèn)參數(shù)來分析數(shù)據(jù),那么,今天,我來分享一個(gè)方法,幫助大家選擇最好的k值。我們邊分享代碼,邊介紹。文獻(xiàn)在MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data,影響因子13分(Genome Biology)
簡(jiǎn)單看一下原理
1、First, MultiK takes a gene expression matrix as input, in which cells are the columns and genes are the rows. Each entry of the input matrix corresponds to the expression of a gene in each cell. MultiK subsamples 80% of the cells from the input preprocessed data matrix and applies the standard Seurat pipeline on the subsampled data matrix 100 times over 40 resolution parameters (from 0.05 to 2.00 with step size 0.05; thus, 4000 subsampling runs in total: 40 resolution parameters × 100 subsamples).. During each run, features are reselected each time to cluster the cells. Then, for each K, MultiK aggregates all the clustering runs that give rise to the same K groups regardless of the resolution parameter and computes a consensus matrix.MultiK then evaluates the consensus of clustering using two metrics: (1) for each K, the frequency of runs where that K is observed

圖片.png
and (2) the relative proportion of ambiguous clustering PAC (rPAC(relative Proportion of Ambiguous Clustering)) score for each K,which is a variation of the PAC score。(PAC quantifies the proportion of entries in the consensus matrix strictly between the lower and upper bounds that determine ambiguity.)The rPAC criterion addresses the upward bias of PAC towards higher K by better handling the proportion of zeros in the consensus matrix. Combining both measures, MultiK produces a scatter plot that shows the relationship between the frequency of K and (1 – rPAC) for each observed K.

圖片.png
To determine several multi-scale optimal K candidates (mostly 2 and up to 3), MultiK applies a convex hull approach [24]. This is based on the upper right of the smallest convex polygon that encloses all the points. MultiK takes extreme points from this set and uses a frequency cutoff of 100 to select candidate Ks。
2、Once candidate Ks are determined, MultiK then performs a second step: label each cluster as either a class or subclass using Statistical Significance of Clustering (SigClust)

圖片.png
MultiK first constructs a dendrogram of the cluster centroids using hierarchical clustering. Then, MultiK runs SigClust on each pair of terminal clusters. Significant terminal pairs in the dendrogram determine classes, and non-significant pairs are subclasses. For consistency of the whole dendrogram, when any split is significant, all parent splits are also considered to be significant. In this way, MultiK assigns class and subclass labels to each terminal cluster (i.e., the leaves of the dendrogram) based on the SigClust significance. This assessment of cluster significance, after deciding on the value of optimal K, helps elucidate the structural relationships between the identified clusters as well.
第一步,加載R包
library(Seurat)
library(sigclust)
###devtools::install_github("siyao-liu/MultiK")
library(MultiK)
MultiK()是實(shí)現(xiàn)Seurat聚類在多個(gè)分辨率參數(shù)上的子采樣和應(yīng)用的主要函數(shù)。
主函數(shù) MultiK( ) 接受一個(gè) Seurat 對(duì)象,該對(duì)象具有歸一化的表達(dá)式矩陣和其他參數(shù),如果未指定,則默認(rèn)值設(shè)置。 MultiK 在 Seurat 聚類中探索了一系列分辨率參數(shù)(從 0.05 到 2.00,步長(zhǎng)為 0.05),并聚合所有產(chǎn)生相同 K 組的聚類運(yùn)行,而不管分辨率參數(shù)如何,并為每個(gè) K 計(jì)算一致矩陣 .

圖片.png
注意:MultiK 在每次子采樣運(yùn)行中重新選擇高度可變的基因。 此外,默認(rèn)情況下,MultiK 在 Seurat 聚類中使用 30 個(gè)主成分和 20 個(gè) K 最近鄰。
運(yùn)行代碼
seu = readRDS(sc_RDS)
步驟 1:運(yùn)行 MultiK 主算法以確定最佳 Ks
運(yùn)行子采樣和一致性聚類以生成用于評(píng)估的輸出(此步驟可能需要很長(zhǎng)時(shí)間)。 出于演示目的,在這里運(yùn)行 10 次。 對(duì)于真實(shí)的數(shù)據(jù)練習(xí),建議至少使用 100 次。
multik <- MultiK(seu, reps=10)
Make MultiK diagnostic plots:
DiagMultiKPlot(multik$k, multik$consensus)

圖片.png
Step 2: Assign classes and subclasses
Get the clustering labels at optimal K level:
clusters <- getClusters(seu, 3)
Run SigClust at optimal K level:
pval <- CalcSigClust(seu, clusters$clusters)
制作診斷圖(這包括在節(jié)點(diǎn)上映射成對(duì) SigClust p 值的聚類質(zhì)心樹狀圖,以及成對(duì) SigClust p 值的熱圖)
PlotSigClust(seu, clusters$clusters, pval)

圖片.png
對(duì),這才是我們想要的結(jié)果
生活很好,有你更好