Smartseq2 scRNA小鼠發(fā)育學(xué)習(xí)筆記-2-根據(jù)表達(dá)矩陣進(jìn)行分群

劉小澤寫于19.10.23
筆記目的:根據(jù)生信技能樹的單細(xì)胞轉(zhuǎn)錄組課程探索Smartseq2技術(shù)及發(fā)育相關(guān)的分析
課程鏈接在:http://jm.grazy.cn/index/mulitcourse/detail.html?cid=55
這次會(huì)介紹如何針對(duì)表達(dá)矩陣進(jìn)行分群,不一定需要包裝好的函數(shù)。對(duì)應(yīng)視頻第三單元5-6講

前言

目的是得到這個(gè)圖

將會(huì)用到來自作者的包裝好的analysis_functions.R代碼:

https://github.com/IStevant/XX-XY-mouse-gonad-scRNA-seq/blob/master/scripts/analysis_functions.R

這個(gè)代碼有1800多行,將會(huì)貫穿整個(gè)分析,正是這些DIY的代碼,才讓文章的圖顯得與眾不同

1 首先創(chuàng)造表達(dá)矩陣

load(file="female_rpkm.Robj")

## 去掉重復(fù)細(xì)胞
#(例如:同一個(gè)細(xì)胞建庫兩次,這里作者用“rep”進(jìn)行了標(biāo)記) 
grep("rep",colnames(female_rpkm))
colnames(female_rpkm)[256:257]
female_rpkm <- female_rpkm[,!colnames(female_rpkm) %in% grep("rep",colnames(female_rpkm), value=TRUE)]

## 只保留編碼基因(去掉類似:X5430419D17Rik、BC003331等)
prot_coding_genes <- read.csv(file="prot_coding.csv", row.names=1)
females <- female_rpkm[rownames(female_rpkm) %in% as.vector(prot_coding_genes$x),]
save(females,file = 'female_rpkm.Rdata')

2 然后使用包裝好的代碼進(jìn)行tSNE

2.1 對(duì)細(xì)胞操作=》細(xì)胞發(fā)育時(shí)期的獲取

細(xì)胞是從6個(gè)時(shí)間點(diǎn)取出的,于是先找到這6個(gè)時(shí)間點(diǎn)

load('../female_rpkm.Rdata')
> dim(females)
[1] 21083   563

> head(colnames(females))
[1] "E10.5_XX_20140505_C01_150331_1" "E10.5_XX_20140505_C02_150331_1"
[3] "E10.5_XX_20140505_C03_150331_1" "E10.5_XX_20140505_C04_150331_2"
[5] "E10.5_XX_20140505_C06_150331_2" "E10.5_XX_20140505_C07_150331_3"

## 取下劃線分隔的第一部分
female_stages <- sapply(strsplit(colnames(females), "_"), `[`, 1)
# 或者
female_stages <- sapply(strsplit(colnames(females), "_"), 
                        function(x)x[1])
# 再或者
female_stages <- stringr::str_split(colnames(females),'_', simplify = T)[,1]

names(female_stages) <- colnames(females)
> table(female_stages)
female_stages
E10.5 E11.5 E12.5 E13.5 E16.5    P6 
   68   100   103    99    85   108 

2.2 對(duì)基因操作=》基因過濾與統(tǒng)計(jì)

去掉在所有細(xì)胞都不表達(dá)的基因
> (dim(females))
[1] 21083   563
> females <- females[rowSums(females)>0,]
> (dim(females))
[1] 16765   563

可以看到去掉了4000多個(gè)

計(jì)算各種統(tǒng)計(jì)指標(biāo)
# 利用apply函數(shù)對(duì)每行(每個(gè)基因)進(jìn)行統(tǒng)計(jì)
mean_per_gene <- apply(females, 1, mean, na.rm = TRUE) 
sd_per_gene <- apply(females, 1, sd, na.rm = TRUE) 
mad_per_gene <- apply(females, 1, mad, na.rm = TRUE) 
cv = sd_per_gene/mean_per_gene
library(matrixStats)
var_per_gene <- rowVars(as.matrix(females))
cv2=var_per_gene/mean_per_gene^2
# 存儲(chǔ)統(tǒng)計(jì)結(jié)果
cv_per_gene <- data.frame(mean = mean_per_gene,
                          sd = sd_per_gene,
                          mad=mad_per_gene,
                          var=var_per_gene,
                          cv=cv,
                          cv2=cv2)
rownames(cv_per_gene) <- rownames(females)
head(cv_per_gene)
# 根據(jù)表達(dá)量過濾統(tǒng)計(jì)結(jié)果
cv_per_gene=cv_per_gene[cv_per_gene$mean>1,]
# 簡易的可視化
with(cv_per_gene,plot(log10(mean),log10(cv2)))

CV值,它表示變異系數(shù)(coefficient of variation)。變異系數(shù)又稱離散系數(shù)或相對(duì)偏差 ,我們肯定都知道標(biāo)準(zhǔn)偏差,也就是sd值,sd描述了數(shù)據(jù)值偏離算術(shù)平均值的程度。這個(gè)相對(duì)偏差CV描述的是標(biāo)準(zhǔn)偏差與平均值之比。

  • sd值,它和均值mean、方差var一樣,都是對(duì)一維數(shù)據(jù)進(jìn)行的分析,如果出現(xiàn)兩組數(shù)據(jù)測量尺度差別太大或數(shù)據(jù)量綱存在差異的話,直接用標(biāo)準(zhǔn)差就不合適了
  • CV變異系數(shù)就可以解決這個(gè)問題,它利用原始數(shù)據(jù)標(biāo)準(zhǔn)差和原始數(shù)據(jù)平均值的比值來各自消除尺度與量綱的差異。
cv-gene plot
復(fù)雜一點(diǎn)的統(tǒng)計(jì)可視化:

其實(shí)就是求每列之間的相關(guān)性

library(psych)
pairs.panels(cv_per_gene, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

可以得到不同統(tǒng)計(jì)指標(biāo)的關(guān)系

不同統(tǒng)計(jì)指標(biāo)的關(guān)系
再用作者包裝的函數(shù):getMostVarGenes()
females_data <- getMostVarGenes(females, fitThr=2)
> dim(females_data)
[1] 822 563

這個(gè)函數(shù)也找了822個(gè)變化比較大的基因,用于下游分析,這其實(shí)也很像Seurat的FindVariableFeatures()做的事情

找了822個(gè)變化比較大的基因
females_data <- log(females_data+1)
> females_data[1:4,1:4]
         E10.5_XX_20140505_C01_150331_1 E10.5_XX_20140505_C02_150331_1
Ngfr                                  0                              0
Slc22a18                              0                              0
Tspan32                               0                              0
Gmpr                                  0                              0
         E10.5_XX_20140505_C03_150331_1 E10.5_XX_20140505_C04_150331_2
Ngfr                          0.4204863                       3.619946
Slc22a18                      0.0000000                       0.000000
Tspan32                       0.0000000                       0.000000
Gmpr                          0.0000000                       0.000000

save(females_data,file = 'females_hvg_matrix.Rdata')

2.3 6個(gè)發(fā)育時(shí)期RtSNE分析

先是PCA

針對(duì)上面的822個(gè)HVGs進(jìn)行操作

female_sub_pca <- FactoMineR::PCA(
  t(females_data), 
  ncp = ncol(females_data), 
  graph=FALSE
)

然后挑選最顯著的主成分,作為tSNE的輸入

記得在Seurat中是使用ElbowPlot() 關(guān)注肘部的PC,這里不需要觀察,直接返回最優(yōu)解

significant_pcs <- jackstraw::permutationPA(
  female_sub_pca$ind$coord, 
  B = 100, 
  threshold = 0.05, 
  verbose = TRUE, 
  seed = NULL
)$r
> significant_pcs
[1] 9
然后使用上面jackstraw挑出的顯著主成分進(jìn)行tSNE
# 6個(gè)時(shí)期給定6個(gè)顏色
female_stagePalette <-  c(
  "#2754b5", 
  "#8a00b0", 
  "#d20e0f", 
  "#f77f05", 
  "#f9db21",
  "#43f14b"
)
female_t_sne <- run_plot_tSNE(
  pca=female_sub_pca,
  pc=significant_pcs,
  iter=5000,
  conditions=female_stages,
  colours=female_stagePalette
)
自定義函數(shù)結(jié)果

2.4 根據(jù)PCA結(jié)果進(jìn)行層次聚類

采用的方法是:Hierarchical Clustering On Principle Components (HCPC)

# 使用9個(gè)顯著主成分重新跑PCA
res.pca <- FactoMineR::PCA(
  t(females_data), 
  ncp = significant_pcs, 
  graph=FALSE
)
# 作者根據(jù)經(jīng)驗(yàn)認(rèn)為分成4群比較好解釋,于是設(shè)置4
res.hcpc <- FactoMineR::HCPC(
  res.pca, 
  graph = FALSE,
  min=4
)
# 得到分群結(jié)果
female_clustering <- res.hcpc$data.clust$clust
> table(female_clustering)
female_clustering
  1   2   3   4 
 90 240 190  43 
# 重新命名
female_clustering <- paste("C", female_clustering, sep="")
names(female_clustering) <- rownames(res.hcpc$data.clust)
# 將C1和C2調(diào)換位置
female_clustering[female_clustering=="C1"] <- "C11"
female_clustering[female_clustering=="C2"] <- "C22"
female_clustering[female_clustering=="C22"] <- "C1"
female_clustering[female_clustering=="C11"] <- "C2"
> table(female_clustering)
female_clustering
 C1  C2  C3  C4 
240  90 190  43 

write.csv(female_clustering, file="female_clustering.csv")
還是基于之前tSNE坐標(biāo),對(duì)聚類得到的4個(gè)cluster可視化
# 為4種cluster設(shè)置顏色
female_clusterPalette <- c(
  "#560047", 
  "#a53bad", 
  "#eb6bac", 
  "#ffa8a0"
)

> head(female_t_sne)
                                  tSNE_1    tSNE_2  cond
E10.5_XX_20140505_C01_150331_1 -2.714291 -24.47912 E10.5
E10.5_XX_20140505_C02_150331_1 -1.580757 -26.45072 E10.5
E10.5_XX_20140505_C03_150331_1 -1.577123 -25.36753 E10.5
E10.5_XX_20140505_C04_150331_2 -6.677577 -20.00208 E10.5
E10.5_XX_20140505_C06_150331_2  3.442235 -23.32570 E10.5
E10.5_XX_20140505_C07_150331_3  3.793953 -23.33955 E10.5

# 作者包裝的函數(shù)
female_t_sne_new_clusters <- plot_tSNE(
  tsne=female_t_sne, 
  conditions=female_clustering, 
  colours= female_clusterPalette
)
ggsave('tSNE_cluster.pdf')
自定義函數(shù)得到的4個(gè)cluster

3 使用Seurat進(jìn)行tSNE

上面我們使用了RPKM矩陣,下面的Seurat將會(huì)使用原始表達(dá)矩陣。當(dāng)然也是推薦使用原始矩陣進(jìn)行分析的

3.1 下載原始表達(dá)矩陣

鏈接在:https://raw.githubusercontent.com/IStevant/XX-XY-mouse-gonad-scRNA-seq/master/data/female_count.Robj

load(file="../female_count.Robj")
load('../female_rpkm.Rdata')
# 直接對(duì)細(xì)胞和基因過濾
female_count <- female_count[rownames(female_count) %in% rownames(females),!colnames(female_count) %in% grep("rep",colnames(female_count), value=TRUE)]

> female_count[1:3,1:3]
      E10.5_XX_20140505_C01_150331_1 E10.5_XX_20140505_C02_150331_1
eGFP                           19582                            526
Gnai3                           2218                            122
Pbsn                               0                              0
      E10.5_XX_20140505_C03_150331_1
eGFP                            4786
Gnai3                              4
Pbsn                               0

save(female_count,file = '../female_count.Rdata')

3.2 對(duì)細(xì)胞操作=》細(xì)胞發(fā)育時(shí)期的獲取

load('../female_count.Rdata')
female_stages <- sapply(strsplit(colnames(female_count), "_"), `[`, 1)
names(female_stages) <- colnames(female_count)
> table(female_stages)
female_stages
E10.5 E11.5 E12.5 E13.5 E16.5    P6 
   68   100   103    99    85   108 

3.3 使用Seurat V3

構(gòu)建對(duì)象
sce_female <- CreateSeuratObject(counts = female_count, 
                           project = "sce_female", 
                           min.cells = 1, min.features = 0)
> sce_female
An object of class Seurat 
16765 features across 563 samples within 1 assay 
Active assay: RNA (16765 features)
添加樣本注釋信息
sce_female <- AddMetaData(object = sce_female, 
                      metadata = apply(female_count, 2, sum), 
                      col.name = 'nUMI_raw')
sce_female <- AddMetaData(object = sce_female, 
                      metadata = female_stages, 
                      col.name = 'female_stages')
數(shù)據(jù)歸一化
sce_female <- NormalizeData(sce_female)
sce_female[["RNA"]]@data[1:3,1:3]
找差異基因HVGs
sce_female <- FindVariableFeatures(sce_female, 
                             selection.method = "vst", 
                             nfeatures = 2000)
# HVGs可視化
VariableFeaturePlot(sce_female)
seurat3 HVGs可視化
seurat3_HVGs <- VariableFeatures(sce_female)
# 檢查與之前得到的HVGs重合度
load('females_hvg_matrix.Rdata')
load('seurat3_HVGs.Rdata')
length(intersect(rownames(females_data),seurat3_HVGs))
# 結(jié)果和之前822個(gè)HVGs有434個(gè)重合
數(shù)據(jù)標(biāo)準(zhǔn)化
# 默認(rèn)只對(duì)FindVariableFeatures得到的HVGs進(jìn)行操作
sce_female <- ScaleData(object = sce_female, 
                    vars.to.regress = c('nUMI_raw'), 
                    model.use = 'linear', 
                    use.umi = FALSE)
PCA降維
sce_female <- RunPCA(sce_female, 
                     features = VariableFeatures(object = sce_female))
降維后聚類
# 這里可以多選一些PCs
sce_female <- FindNeighbors(sce_female, dims = 1:20)
sce_female <- FindClusters(sce_female, resolution = 0.3)
進(jìn)行tSNE
ElbowPlot(sce_female)
sce_female_tsne <- RunTSNE(sce_female, dims = 1:9)
tSNE結(jié)果可視化
# 6個(gè)發(fā)育時(shí)間
DimPlot(object = sce_female_tsne, reduction = "tsne",
        group.by = 'female_stages')
# 4個(gè)cluster
DimPlot(sce_female_tsne, reduction = "tsne")
seurat結(jié)果
比較兩次的聚類結(jié)果
cluster1 <- read.csv('female_clustering.csv')
cluster2 <- as.data.frame(Idents(sce_female_tsne))
# 把它們放在一起比較,前提條件是它們的行名相同
> identical(cluster1[,1],rownames(cluster2))
[1] TRUE

> table(cluster1[,2],cluster2[,1])
    
       0   1   2   3
  C1 224   3  13   0
  C2   6   0  84   0
  C3  12 177   0   1
  C4   0   0   0  43

這也說明了,不同方法雖然選擇的HVGs數(shù)量不同,也不完全一樣,聚類的參數(shù)也不同,但最后真正的生物學(xué)意義是不會(huì)去掉的。只能說,最后選多少群是根據(jù)分析的人根據(jù)自己的理解去解釋,只要參數(shù)變化,就會(huì)有各種不同的結(jié)果

4 使用更簡單的函數(shù)去分群

rm(list = ls()) 
options(warn=-1) 
options(stringsAsFactors = F)
load('../female_rpkm.Rdata')

# 根據(jù)分群獲得顏色
cluster <- read.csv('female_clustering.csv')
color <- rainbow(4)[as.factor(cluster[,2])]
> table(color)
color
#00FFFFFF #8000FFFF #80FF00FF #FF0000FF 
      190        43        90       240 

# 取前1000個(gè)sd最大的基因作為HVGs
choosed_count <- females
# 表達(dá)矩陣過濾
choosed_count <- choosed_count[apply(choosed_count, 1, sd)>0,]
choosed_count <- choosed_count[names(head(sort(apply(choosed_count, 1, sd),decreasing = T),1000)),]
進(jìn)行PCA分析
pca_out <- prcomp(t(choosed_count),scale. = T)

>   pca_out$x[1:3,1:3]
                                    PC1        PC2        PC3
E10.5_XX_20140505_C01_150331_1 13.21660 -4.1600782  1.5287334
E10.5_XX_20140505_C02_150331_1 13.73109 -0.2848806 -0.8443587
E10.5_XX_20140505_C03_150331_1 10.89558 -0.2720221 -3.3839651

library(ggfortify)
autoplot(pca_out, col=color) +theme_classic()+ggtitle('PCA plot')
進(jìn)行tSNE
library(Rtsne)
# 依舊選前9個(gè)
tsne_out <- Rtsne(pca_out$x[,1:9], perplexity = 10,
                    pca = F, max_iter = 2000,
                    verbose = T)
tsnes_cord <- tsne_out$Y
colnames(tsnes_cord) <- c('tSNE1','tSNE2')
ggplot(tsnes_cord, aes(x=tSNE1, y = tSNE2)) + geom_point(col=color) +   theme_classic()+ggtitle('tSNE plot')

除了之前的HCPC和seurat分群,還可以利用DBSCAN、kmeans分群

# 這個(gè)運(yùn)行會(huì)非常慢!
if(T){
  library(Rtsne)
  N_tsne <- 10
  # 其實(shí)這里應(yīng)該多運(yùn)行一些,例如N_tsne <- 50,只不過時(shí)間太久,就當(dāng)做是示例吧
  tsne_out <- list(length = N_tsne)
  KL <- vector(length = N_tsne)
  set.seed(1234)
  for(k in 1:N_tsne)
  {
    tsne_out[[k]]<-Rtsne(t(log2(females+1)),initial_dims=30,verbose=FALSE,check_duplicates=FALSE,
                         perplexity=27, dims=2,max_iter=5000)
    KL[k]<-tail(tsne_out[[k]]$itercosts,1)
    print(paste0("FINISHED ",k," TSNE ITERATION"))
  }
  names(KL) <- c(1:N_tsne)
  opt_tsne <- tsne_out[[as.numeric(names(KL)[KL==min(KL)])]]$Y
}
# DBSCAN結(jié)果
library(dbscan)
plot(opt_tsne,  col=dbscan(opt_tsne,eps=3.1)$cluster, 
     pch=19, xlab="tSNE dim 1", ylab="tSNE dim 2")
DBSCAN結(jié)果
# kmeans結(jié)果
plot(opt_tsne,  col=kmeans(opt_tsne,centers = 4)$clust, 
     pch=19, xlab="tSNE dim 1", ylab="tSNE dim 2")
kmeans結(jié)果

比較它們的差異

# 其中kmeans是4群
> table(kmeans(opt_tsne,centers = 4)$clust,dbscan(opt_tsne,eps=3.5)$cluster)
   
      0   1   2   3   4
  1   2   0   0 206   0
  2   1 106   0   0   0
  3   0  93  10   0   0
  4   1 138   0   1   5
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容