單細(xì)胞轉(zhuǎn)錄組分析---seurat 1例子

==================安裝seurant================

install.packages('Seurat')


=====測(cè)試?yán)樱簆bmc3k_filtered_gene_bc_matrices.tar.gz==

library(Seurat)

library(dplyr)

library(Matrix)

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

dense.size <- object.size(x = as.matrix(x = pbmc.data))

sparse.size <- object.size(x = pbmc.data)

dense.size/sparse.size

# Initialize the Seurat objectwiththeraw(non-normalized data).

Keep all# genes expressed in>=3 cells(~0.1%of the data).Keep all cells with at# least 200 detected genes

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

pbmc

? ? ? ? ?#An object of class Seurat

? ? ? ? #13714 features across 2700 samples within 1 assay

? ? ? ? #Active assay: RNA (13714 features, 0 variable features)

=========QC======================

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern ="^MT-")

# Visualize QC metrics as a violin plot

VlnPlot(pbmc, features = c("nFeature_RNA","nCount_RNA","percent.mt"), ncol =3)

#nFeature_RNA:代表的是在該細(xì)胞中共檢測(cè)到的表達(dá)量大于0的基因個(gè)數(shù),nCount_RNA:代表的是該細(xì)胞中所有基因的表達(dá)量之和,percent.mt:代表的是線粒體基因表達(dá)量的百分比,通過(guò)小提琴圖來(lái)展示

圖中每個(gè)點(diǎn)代表的是一個(gè)細(xì)胞,反應(yīng)的是對(duì)應(yīng)特征在所有細(xì)胞的一個(gè)分布情況。通過(guò)觀察上圖,我們可以確定一個(gè)大概的篩選范圍。以nFeature_RNA為例,可以看到數(shù)值在2000以上的細(xì)胞是非常少的,可以看做是離群值,所以在篩選時(shí),如果一個(gè)細(xì)胞中檢測(cè)到的基因個(gè)數(shù)大于2000,就可以進(jìn)行過(guò)濾。

在過(guò)濾閾值時(shí),我們還需要考慮一個(gè)因素,就是這3個(gè)指標(biāo)之間的相互關(guān)系,可以通過(guò)以下方式得到:

plot1 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="percent.mt")

plot2 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="nFeature_RNA")

plot1 + plot2


以nCount和gene之間的關(guān)系圖為例,可以看到非常明顯的一個(gè)相關(guān)性,當(dāng)gene個(gè)數(shù)為2000時(shí)對(duì)應(yīng)的umi在10000左右,所以在設(shè)定閾值,我們想過(guò)濾掉gene大于2000的細(xì)胞,此時(shí)umi的閾值就應(yīng)該設(shè)置在10000左右。


結(jié)合以上兩種圖表,判斷出一個(gè)合適的閾值之后,就可以進(jìn)行過(guò)濾了,代碼如下: (其實(shí)我沒(méi)看懂這參數(shù)是怎么選擇的,總覺(jué)得2000就可以了)

pbmc <- subset(pbmc, subset = nFeature_RNA >200& nFeature_RNA <2500& percent.mt <5)

===============normalize=============

預(yù)處理之后,首先進(jìn)行歸一化:

pbmc <- NormalizeData(pbmc, normalization.method ="LogNormalize", scale.factor =10000)

默認(rèn)采用LogNormalize歸一化算法,首先將原始的表達(dá)量轉(zhuǎn)換成相對(duì)豐度,然后乘以scale.factor的值,在取對(duì)數(shù)。

========feature selection=============

歸一化之后,Seurat提取那些在細(xì)胞間變異系數(shù)較大的基因用于下游分析:

pbmc <- FindVariableFeatures(pbmc, selection.method ="vst", nfeatures =2000)

默認(rèn)的將返回2000個(gè)高變異的基因。

top10 <- head(VariableFeatures(pbmc), 10)? ? ? //返回最高變異的10個(gè)基因

plot1 <- VariableFeaturePlot(pbmc)

plot1

從圖中可以看出挑選的2000個(gè)高變異的基因明顯分布和誤差比低變異度的高。

plot2 <- LabelPoints(plot = plot1, points = top10)

plot2? ? ?//加上前10個(gè)高變異基因的label更加清晰


============scale=================

scale的目的,官方是如此描述的:

Shifts the expression of each gene, so that the mean expression across cells is 0? ?//保持同一個(gè)基因在不同的cell之間 均值為0

Scales the expression of each gene, so that the variance across cells is 1? ?//保持同一個(gè)基因在不同cell之間的方差為1? ?

This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate? ? //這樣避免高表達(dá)基因過(guò)度占據(jù)

注:我還在想normalization和scale的區(qū)別

all.genes <- rownames(pbmc)

pbmc <- ScaleData(pbmc, features = all.genes)

=============PCA================

聚類分析用于識(shí)別細(xì)胞亞型,在Seurat中,不是直接對(duì)所有細(xì)胞進(jìn)行聚類分析,而是首先進(jìn)行PCA主成分分析,然后挑選貢獻(xiàn)量最大的幾個(gè)主成分,用挑選出的主成分的值來(lái)進(jìn)行聚類分析

對(duì)PCA分析結(jié)果可以進(jìn)行一系列的可視化: print, VizDimLoadings, DimPlot, and DimHeatmap

print(pbmc[["pca"]], dims =1:5, nfeatures =5)

## PC_ 1

## Positive:? CST3, TYROBP, LST1, AIF1, FTL

## Negative:? MALAT1, LTB, IL32, IL7R, CD2

## PC_ 2

## Positive:? CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1

## Negative:? NKG7, PRF1, CST7, GZMB, GZMA

## PC_ 3

## Positive:? HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1

## Negative:? PPBP, PF4, SDPR, SPARC, GNG11

## PC_ 4

## Positive:? HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1

## Negative:? VIM, IL7R, S100A6, IL32, S100A8

## PC_ 5

## Positive:? GZMB, NKG7, S100A8, FGFBP2, GNLY

## Negative:? LTB, IL7R, CKB, VIM, MS4A7


VizDimLoadings(pbmc, dims =1:2, reduction ="pca")


DimPlot(pbmc, reduction ="pca")


DimHeatmap(pbmc, dims =1, cells =500, balanced =TRUE)


DimHeatmap(pbmc, dims =1:15, cells =500, balanced =TRUE)


=======找到有統(tǒng)計(jì)學(xué)顯著性的主成分========

主成分分析結(jié)束后需要確定哪些主成分所代表的基因可以進(jìn)入下游分析,這里可以使用JackStraw做重抽樣分析。用JackStrawPlot可視化看看哪些主成分可以進(jìn)行下游分析

pbmc <- JackStraw(pbmc, num.replicate =100)

pbmc <- ScoreJackStraw(pbmc, dims =1:20)

JackStrawPlot(pbmc, dims =1:15)


In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.??

當(dāng)然,也可以用最經(jīng)典的碎石圖來(lái)確定主成分。

ElbowPlot(pbmc)


//在這個(gè)例子中,我們可以觀察到在PC9-10附近趨于平緩,這表明大部分真實(shí)信號(hào)是在前10個(gè)PCs中捕獲的

這個(gè)確定主成分是非常有挑戰(zhàn)性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.

在本例子里面,幾種方法結(jié)果差異不大,可以在PC7~10直接挑選。

======cluster the cell==============

pbmc <- FindNeighbors(pbmc, dims =1:10)

pbmc <- FindClusters(pbmc, resolution =0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

## Number of nodes: 2638

## Number of edges: 96033

## Running Louvain algorithm...

## Maximum modularity in 10 random starts: 0.8720

## Number of communities: 9

## Elapsed time: 0 seconds

head(Idents(pbmc),5)

pbmc <- RunUMAP(pbmc, dims =1:10)

DimPlot(pbmc, reduction ="umap")


====鑒定cluster之間差異基因==========

cluster1.markers <- FindMarkers(pbmc, ident.1 =1, min.pct =0.25)? ?//鑒定cluster 1中的marker gene

head(cluster1.markers, n =5)


cluster5.markers <- FindMarkers(pbmc, ident.1 =5, ident.2 = c(0,3), min.pct =0.25)

//鑒定cluster 5區(qū)別于cluster 0和3的marker gene

head(cluster5.markers, n =5)



pbmc.markers <- FindAllMarkers(pbmc, only.pos =TRUE, min.pct =0.25, logfc.threshold =0.25)

//find markers for every cluster compared to all remaining cells,report only the positive ones

pbmc.markers %>% group_by(cluster) %>% top_n(n =2, wt = avg_logFC)


同時(shí),該包提供了一系列可視化方法來(lái)檢查差異分析的結(jié)果的可靠性:

VlnPlot (shows expression probability distributions across clusters)

FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations

RidgePlot, CellScatter, and DotPlot


VlnPlot(pbmc, features = c("MS4A1","CD79A"))


VlnPlot(pbmc, features = c("NKG7","PF4"), slot ="counts", log =TRUE)


FeaturePlot(pbmc, features =c("MS4A1","GNLY","CD3E","CD14","FCER1A","FCGR3A","LYZ","PPBP","CD8A"))


top20 <- pbmc.markers %>% group_by(cluster) %>% top_n(n =20, wt = avg_logFC)

DoHeatmap(pbmc, features = top20$gene) + NoLegend()

======給每個(gè)cluster注釋=========

Assigning cell type identity to clusters

這個(gè)主要取決于生物學(xué)背景知識(shí):


new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")

names(new.cluster.ids) <- levels(pbmc)

pbmc <- RenameIdents(pbmc, new.cluster.ids)

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()


saveRDS(pbmc, file = "pbmc3k_final.rds")

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容