系列回顧:
1.RNA velocity分析練習(xí)(一)文件下載以及預(yù)處理
2.RNA velocity分析練習(xí)(二)軟件安裝
3.RNA velocity分析練習(xí)(三)生成loom文件
這一篇文章就正式開(kāi)始進(jìn)行RNA velocity的練習(xí)。本篇文章代碼參考:
velocyto簡(jiǎn)單使用
(一)加載數(shù)據(jù)并過(guò)濾
> input_loom <- "MyTissue.loom"
> adata <- read.loom.matrices(input_loom)
# Use the spliced data as the input data
> spliced_adata <- adata$spliced
# seperate into E12.5 and E13.5
# 因?yàn)槲恼吕镏挥昧薊12.5天的數(shù)據(jù)來(lái)進(jìn)行RNA velocity的分析,所以這里我也把它們區(qū)分一下
> E12.5_data <- spliced_adata[,1:384]
> E13.5_data <- spliced_adata[,385:768]
看一下spliced mRNA count的分布(下面的分析都用E12.5數(shù)據(jù)):
# spliced expression magnitude distribution across genes:
> hist(log10(rowSums(E12.5_data)+1),col='wheat',xlab='log10[ number of reads + 1]',main='number of reads per gene')

過(guò)濾前看一下多少基因:
> dim(E12.5_data)
[1] 26610 384 #一共是26610個(gè)基因
把在所有細(xì)胞里都不表達(dá)的基因刪掉:
> E12.5_data <- E12.5_data[as.numeric(rowSums(E12.5_data)) != 0,]
> dim(E12.5_data)
[1] 23803 384 #還剩23803個(gè)基因
保留至少在10個(gè)以上的細(xì)胞里表達(dá)的基因:
> fdata_E12.5 = E12.5_data[apply(E12.5_data,1,function(x) sum(x>1) > 10),]
> dim(fdata_E12.5)
[1] 13767 384 #這里還剩13767個(gè)基因
現(xiàn)在再看一下reads的分布:
> hist(log10(rowSums(fdata_E12.5)+1),col='wheat',xlab='log10[ number of reads + 1]',main='number of reads per gene')

(二)標(biāo)準(zhǔn)化
上面我們?nèi)〉氖莝pliced的數(shù)據(jù)(我們平時(shí)做表達(dá)分析的時(shí)候,實(shí)際上用的都是spliced),現(xiàn)在用它來(lái)進(jìn)行數(shù)據(jù)的標(biāo)準(zhǔn)化,以便后面的細(xì)胞聚類(lèi)。
> library(pagoda2)
#這里要先過(guò)濾一下基因名,如果不過(guò)濾的話,后面標(biāo)準(zhǔn)化步驟會(huì)報(bào)錯(cuò),告訴你基因名有重復(fù),我也不知道怎么會(huì)有重復(fù),很奇怪
> fdata_E12.5 <- fdata_E12.5[duplicated(rownames(fdata_E12.5))=="FALSE",]
> dim(fdata_E12.5)
[1] 13765 384 #這里只少了2個(gè)基因,說(shuō)明有2個(gè)基因重復(fù)了
> fdata_E12.5_nor <- Pagoda2$new(fdata_E12.5,modelType='plain',trim=10,log.scale=T)
384 cells, 13765 genes; normalizing ... using plain model winsorizing ... log scale ... done.
#可視化
> fdata_E12.5_nor$adjustVariance(plot=T,do.par=T,gam.k=10)
calculating variance fit ... using gam 2396 overdispersed genes ... 2396persisting ... done.

(三)細(xì)胞聚類(lèi)
# clustering cells,embedding analysis
> fdata_E12.5_nor$calculatePcaReduction(nPcs=100,n.odgenes=3e3,maxit=300)
running PCA using 3000 OD genes .... done
> fdata_E12.5_nor$makeKnnGraph(k=30,type='PCA',center=T,distance='cosine')
creating space of type angular done
adding data ... done
building index ... done
querying ... done
> fdata_E12.5_nor$getKnnClusters(method=multilevel.community,type='PCA',name='multilevel')
> fdata_E12.5_nor$getEmbedding(type='PCA',embeddingType='tSNE',perplexity=50,verbose=T)
#make figure(clustering)
> par(mfrow=c(1,2))
> fdata_E12.5_nor$plotEmbedding(type='PCA',embeddingType='tSNE',show.legend=F,mark.clusters=T,min.group.size=10,shuffle.colors=F,mark.cluster.cex=1,alpha=0.7,main='cell clusters')
> fdata_E12.5_nor$plotEmbedding(type='PCA',embeddingType='tSNE',colors=fdata_E12.5_nor$depth,main='depth')

(四)RNA velocity分析
現(xiàn)在我們需要把spliced和unspliced的數(shù)據(jù)導(dǎo)出來(lái),和我們上面得到的標(biāo)準(zhǔn)化后的數(shù)據(jù)做個(gè)交集。然后把RNA velocity嵌合進(jìn)t-SNE的圖里。
#prepare spliced data and unspliced data
> emat <- adata$spliced
> nmat <- adata$unspliced
# filtered cells according to fdata_E12.5_nor$counts
> emat <- emat[,rownames(fdata_E12.5_nor$counts)]
> nmat <- nmat[,rownames(fdata_E12.5_nor$counts)]
#lable the clustered data對(duì)分類(lèi)數(shù)據(jù)進(jìn)行標(biāo)記
> cluster.label <- fdata_E12.5_nor$clusters$PCA$multilevel # take the cluster factor that was calculated by fdata_E12.5_nor
> cell.colors <- pagoda2:::fac2col(cluster.label)
# take embedding from fdata_E12.5_nor
> emb <- fdata_E12.5_nor$embeddings$PCA$tSNE
# Calculate the distance between cells
> cell.dist <- as.dist(1-armaCor(t(fdata_E12.5_nor$reductions$PCA)))
#基于最小平均表達(dá)量篩選基因(至少在一個(gè)簇中),輸出產(chǎn)生的有效基因數(shù)
> emat <- filter.genes.by.cluster.expression(emat,cluster.label,min.max.cluster.average = 0.2)
> nmat <- filter.genes.by.cluster.expression(nmat,cluster.label,min.max.cluster.average = 0.05)
> length(intersect(rownames(emat),rownames(nmat)))
[1] 13016
# Calculate RNA velocity
> fit.quantile <- 0.02
> rvel.cd <- gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells=25,cell.dist=cell.dist,fit.quantile=fit.quantile)
calculating cell knn ... done
calculating convolved matrices ... done
fitting gamma coefficients ... done. succesfful fit for 13016 genes
filtered out 2115 out of 13016 genes due to low nmat-emat correlation
filtered out 1543 out of 10901 genes due to low nmat-emat slope
calculating RNA velocity shift ... done
calculating extrapolated cell state ... done
可視化:
> show.velocity.on.embedding.cor(emb,rvel.cd,n=200,scale='sqrt',cell.colors=ac(cell.colors,alpha=0.5),cex=0.8,arrow.scale=3,show.grid.flow=TRUE,min.grid.cell.mass=0.5,grid.n=40,arrow.lwd=1,do.par=F,cell.border.alpha = 0.1)
delta projections ... sqrt knn ... transition probs ... done
calculating arrows ... done
grid estimates ... grid.sd= 0.3592013 min.arrow.size= 0.007184026 max.grid.arrow.length= 0.02704895 done

可視化指定的基因:
> gene <- "Serpine2"
> gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells = 100,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=cell.colors,cell.dist=cell.dist,show.gene=gene,old.fit=rvel.cd,do.par=T)
> gene <- "Chga"
> gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells = 100,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=cell.colors,cell.dist=cell.dist,show.gene=gene,old.fit=rvel.cd,do.par=T)


(五)與文獻(xiàn)原圖進(jìn)行比較
到這里就基本完成RNA velocity的分析了。然后我想把自己的圖跟文獻(xiàn)里的對(duì)比一下,于是就放在了PPT里做了一張組合圖:

可以看到,原文里的細(xì)胞分化軌跡是“藍(lán)>紅(黃)>綠”的一個(gè)順序,我的是“綠色> 粉(紅)>藍(lán)>黃”,顏色不要緊,主要是要有一個(gè)線性的方向。所以這里文獻(xiàn)里的SCP細(xì)胞在我的圖里就是綠色群,Differentiation bridge是紅色/粉色群,chromaffin就是我的黃色的群(或者藍(lán)色?)
原文里展示的兩個(gè)基因,Serpine2主要表達(dá)在SCP群里(b左),unspliced mRNA主要出現(xiàn)在SCP群里(b中,右),也就是說(shuō)在SCP里表達(dá)量很高,因?yàn)橛泻芏嘈罗D(zhuǎn)錄出來(lái)的unspliced mRNA;Chga基因主要表達(dá)在chromaffin群里,也就是分化末端里(c左),這個(gè)基因的unspliced mRNA主要在bridge群和一部分的chromaffin群里(c中,右),說(shuō)明這個(gè)基因在分化過(guò)渡階段和分化終端表達(dá)量高。大體上我的結(jié)果和文章里的還是差不多的。說(shuō)明分析過(guò)程沒(méi)有太大的問(wèn)題。