==================安裝seurant================
install.packages('Seurat')
=====測(cè)試?yán)樱簆bmc3k_filtered_gene_bc_matrices.tar.gz==
library(Seurat)
library(dplyr)
library(Matrix)
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
dense.size <- object.size(x = as.matrix(x = pbmc.data))
sparse.size <- object.size(x = pbmc.data)
dense.size/sparse.size
# Initialize the Seurat objectwiththeraw(non-normalized data).
Keep all# genes expressed in>=3 cells(~0.1%of the data).Keep all cells with at# least 200 detected genes
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
? ? ? ? ?#An object of class Seurat
? ? ? ? #13714 features across 2700 samples within 1 assay
? ? ? ? #Active assay: RNA (13714 features, 0 variable features)
=========QC======================
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern ="^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA","nCount_RNA","percent.mt"), ncol =3)
#nFeature_RNA:代表的是在該細(xì)胞中共檢測(cè)到的表達(dá)量大于0的基因個(gè)數(shù),nCount_RNA:代表的是該細(xì)胞中所有基因的表達(dá)量之和,percent.mt:代表的是線粒體基因表達(dá)量的百分比,通過(guò)小提琴圖來(lái)展示

圖中每個(gè)點(diǎn)代表的是一個(gè)細(xì)胞,反應(yīng)的是對(duì)應(yīng)特征在所有細(xì)胞的一個(gè)分布情況。通過(guò)觀察上圖,我們可以確定一個(gè)大概的篩選范圍。以nFeature_RNA為例,可以看到數(shù)值在2000以上的細(xì)胞是非常少的,可以看做是離群值,所以在篩選時(shí),如果一個(gè)細(xì)胞中檢測(cè)到的基因個(gè)數(shù)大于2000,就可以進(jìn)行過(guò)濾。
在過(guò)濾閾值時(shí),我們還需要考慮一個(gè)因素,就是這3個(gè)指標(biāo)之間的相互關(guān)系,可以通過(guò)以下方式得到:
plot1 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="nFeature_RNA")
plot1 + plot2

以nCount和gene之間的關(guān)系圖為例,可以看到非常明顯的一個(gè)相關(guān)性,當(dāng)gene個(gè)數(shù)為2000時(shí)對(duì)應(yīng)的umi在10000左右,所以在設(shè)定閾值,我們想過(guò)濾掉gene大于2000的細(xì)胞,此時(shí)umi的閾值就應(yīng)該設(shè)置在10000左右。
結(jié)合以上兩種圖表,判斷出一個(gè)合適的閾值之后,就可以進(jìn)行過(guò)濾了,代碼如下: (其實(shí)我沒(méi)看懂這參數(shù)是怎么選擇的,總覺(jué)得2000就可以了)
pbmc <- subset(pbmc, subset = nFeature_RNA >200& nFeature_RNA <2500& percent.mt <5)
===============normalize=============
預(yù)處理之后,首先進(jìn)行歸一化:
pbmc <- NormalizeData(pbmc, normalization.method ="LogNormalize", scale.factor =10000)
默認(rèn)采用LogNormalize歸一化算法,首先將原始的表達(dá)量轉(zhuǎn)換成相對(duì)豐度,然后乘以scale.factor的值,在取對(duì)數(shù)。
========feature selection=============
歸一化之后,Seurat提取那些在細(xì)胞間變異系數(shù)較大的基因用于下游分析:
pbmc <- FindVariableFeatures(pbmc, selection.method ="vst", nfeatures =2000)
默認(rèn)的將返回2000個(gè)高變異的基因。
top10 <- head(VariableFeatures(pbmc), 10)? ? ? //返回最高變異的10個(gè)基因
plot1 <- VariableFeaturePlot(pbmc)
plot1

從圖中可以看出挑選的2000個(gè)高變異的基因明顯分布和誤差比低變異度的高。
plot2 <- LabelPoints(plot = plot1, points = top10)
plot2? ? ?//加上前10個(gè)高變異基因的label更加清晰

============scale=================
scale的目的,官方是如此描述的:
Shifts the expression of each gene, so that the mean expression across cells is 0? ?//保持同一個(gè)基因在不同的cell之間 均值為0
Scales the expression of each gene, so that the variance across cells is 1? ?//保持同一個(gè)基因在不同cell之間的方差為1? ?
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate? ? //這樣避免高表達(dá)基因過(guò)度占據(jù)
注:我還在想normalization和scale的區(qū)別
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
=============PCA================
聚類分析用于識(shí)別細(xì)胞亞型,在Seurat中,不是直接對(duì)所有細(xì)胞進(jìn)行聚類分析,而是首先進(jìn)行PCA主成分分析,然后挑選貢獻(xiàn)量最大的幾個(gè)主成分,用挑選出的主成分的值來(lái)進(jìn)行聚類分析
對(duì)PCA分析結(jié)果可以進(jìn)行一系列的可視化: print, VizDimLoadings, DimPlot, and DimHeatmap
print(pbmc[["pca"]], dims =1:5, nfeatures =5)
## PC_ 1
## Positive:? CST3, TYROBP, LST1, AIF1, FTL
## Negative:? MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive:? CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative:? NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive:? HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative:? PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive:? HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative:? VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive:? GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative:? LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims =1:2, reduction ="pca")

DimPlot(pbmc, reduction ="pca")

DimHeatmap(pbmc, dims =1, cells =500, balanced =TRUE)

DimHeatmap(pbmc, dims =1:15, cells =500, balanced =TRUE)

=======找到有統(tǒng)計(jì)學(xué)顯著性的主成分========
主成分分析結(jié)束后需要確定哪些主成分所代表的基因可以進(jìn)入下游分析,這里可以使用JackStraw做重抽樣分析。用JackStrawPlot可視化看看哪些主成分可以進(jìn)行下游分析
pbmc <- JackStraw(pbmc, num.replicate =100)
pbmc <- ScoreJackStraw(pbmc, dims =1:20)
JackStrawPlot(pbmc, dims =1:15)

In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.??
當(dāng)然,也可以用最經(jīng)典的碎石圖來(lái)確定主成分。
ElbowPlot(pbmc)

//在這個(gè)例子中,我們可以觀察到在PC9-10附近趨于平緩,這表明大部分真實(shí)信號(hào)是在前10個(gè)PCs中捕獲的
這個(gè)確定主成分是非常有挑戰(zhàn)性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.
在本例子里面,幾種方法結(jié)果差異不大,可以在PC7~10直接挑選。
======cluster the cell==============
pbmc <- FindNeighbors(pbmc, dims =1:10)
pbmc <- FindClusters(pbmc, resolution =0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## Number of nodes: 2638
## Number of edges: 96033
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
head(Idents(pbmc),5)
pbmc <- RunUMAP(pbmc, dims =1:10)
DimPlot(pbmc, reduction ="umap")

====鑒定cluster之間差異基因==========
cluster1.markers <- FindMarkers(pbmc, ident.1 =1, min.pct =0.25)? ?//鑒定cluster 1中的marker gene
head(cluster1.markers, n =5)

cluster5.markers <- FindMarkers(pbmc, ident.1 =5, ident.2 = c(0,3), min.pct =0.25)
//鑒定cluster 5區(qū)別于cluster 0和3的marker gene
head(cluster5.markers, n =5)

pbmc.markers <- FindAllMarkers(pbmc, only.pos =TRUE, min.pct =0.25, logfc.threshold =0.25)
//find markers for every cluster compared to all remaining cells,report only the positive ones
pbmc.markers %>% group_by(cluster) %>% top_n(n =2, wt = avg_logFC)

同時(shí),該包提供了一系列可視化方法來(lái)檢查差異分析的結(jié)果的可靠性:
VlnPlot (shows expression probability distributions across clusters)
FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations
RidgePlot, CellScatter, and DotPlot
VlnPlot(pbmc, features = c("MS4A1","CD79A"))

VlnPlot(pbmc, features = c("NKG7","PF4"), slot ="counts", log =TRUE)

FeaturePlot(pbmc, features =c("MS4A1","GNLY","CD3E","CD14","FCER1A","FCGR3A","LYZ","PPBP","CD8A"))

top20 <- pbmc.markers %>% group_by(cluster) %>% top_n(n =20, wt = avg_logFC)
DoHeatmap(pbmc, features = top20$gene) + NoLegend()

======給每個(gè)cluster注釋=========
Assigning cell type identity to clusters
這個(gè)主要取決于生物學(xué)背景知識(shí):

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

saveRDS(pbmc, file = "pbmc3k_final.rds")