monocle實(shí)操
1 軟件安裝
#所需軟件
tidyverse
monocle(建議使用2.22.0,2.4為測試版)
Seurat
ggsci
1.1 conda安裝——Linux
https://anaconda.org/bioconda/bioconductor-monocle
1.2 BiocManager安裝——R
https://bioconductor.org/packages/release/bioc/html/monocle.html
2 環(huán)境激活及軟件包加載
conda activate monocle
library(tidyverse)
library(monocle)
library(Seurat)
library(ggsci)
3 利用monocle2構(gòu)建S4對象(monocle2指定的數(shù)據(jù)結(jié)構(gòu))
#測試數(shù)據(jù)
pro <- readRDS("/mnt/sdd/singleron_training_class/resources/monocle/pbmc_add_cluster.rds")
3.1 表達(dá)矩陣——行名是基因,列名是細(xì)胞編號,建議使用count矩陣
count <- GetAssayData(object = pro[["RNA"]], slot = "counts")
3.2 基因注釋信息——第一列是基因編號,第一列列名必須為“geneshortname”,隨著分析軟件會(huì)自動(dòng)為該數(shù)據(jù)框添加列信息
fdata <- data.frame(gene_short_name = row.names(count), row.names = row.names(count))
3.3 細(xì)胞的表型信息phenoData: 第一列是細(xì)胞編號(barcode),其他列是細(xì)胞的相關(guān)信息(樣本名稱、細(xì)胞類型等)
pdata <- pro@meta.data
3.4 構(gòu)建monocle對象
my_cds <- newCellDataSet(count,
featureData = new("AnnotatedDataFrame", data =fdata),
phenoData = new("AnnotatedDataFrame", data = pdata),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
1、negbinomial.size()/ negbinomial()參數(shù)使用UMI表達(dá)矩陣,negbinomial()結(jié)果更加準(zhǔn)確,但耗時(shí);稀疏矩陣使用negbinomial.size(); 2、tobit():適用于輸入表達(dá)矩陣為FPKM/ TPM,構(gòu)建monocle對象時(shí),會(huì)自動(dòng)取log; 3、gaussianff():適用于輸入表達(dá)矩陣為取log后的FPKM/ TPM
4 標(biāo)準(zhǔn)化、scale及篩選
#標(biāo)準(zhǔn)化細(xì)胞之間的mRNA的差異
my_cds <- estimateSizeFactors(my_cds)
#離散度值可以幫助我們進(jìn)行后續(xù)的差異分析
my_cds <- estimateDispersions(my_cds)
#在fData(my_cds)中添加“num_cell_expressed”列,可用于過濾細(xì)胞
my_cds <- detectGenes(my_cds, min_expr = 0.1)
#像pData(my_cds)中添加“UMI”列
pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds))
5 篩選高變基因
為啥需要此步? monocle2軌跡構(gòu)建分為無監(jiān)督與半監(jiān)督過程, 無監(jiān)督:軟件自行在數(shù)據(jù)中選擇發(fā)育差異表達(dá)、cluster差異表達(dá)或離散程度較高的基因進(jìn)行軌跡構(gòu)建,不受人為干預(yù); 半監(jiān)督:利用marker基因定義生物學(xué)進(jìn)程,可以使用先驗(yàn)知識輔助構(gòu)建軌跡;此處我們使用高變基因輔助軌跡構(gòu)建
pro <- FindVariableFeatures(pro, selection.method = "vst", nfeatures = 2000)
ordergene <- head(VariableFeatures(pro), 2000)
6 標(biāo)記排序基因——將高變基因嵌入cds,后續(xù)分析對此依賴
my_cds <- setOrderingFilter(my_cds,ordergene)
p <- plot_ordering_genes(my_cds)
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ordering_genes.pdf")
print(p)
dev.off()
黑色點(diǎn)表示軌跡構(gòu)建使用的基因,灰色為背景基因,紅線為基因大小和離散程度分布趨勢
7 降維
my_cds <- reduceDimension(my_cds, reduction_method = "DDRTree")
8 擬時(shí)間軸軌跡構(gòu)建和在擬時(shí)間內(nèi)排列細(xì)胞
my_cds <- orderCells(my_cds)
9 可視化
9.1 cluster在軌跡上的展示
p <- plot_cell_trajectory(my_cds, color_by = "cluster")
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_cluster.pdf")
print(p)
dev.off()
9.2 state在軌跡上的展示
p <- plot_cell_trajectory(my_cds, color_by = "State")
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_satate.pdf")
print(p)
dev.off()
9.3 時(shí)序在軌跡上的展示
p <- plot_cell_trajectory(my_cds, color_by = "Pseudotime")
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_pseudotime.pdf")
print(p)
dev.off()
9.3 分面展示
p <- plot_cell_trajectory(my_cds, color_by = "State") +
facet_wrap(~State)
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_facet_satate.pdf")
print(p)
dev.off()
#p <- plot_cell_trajectory(my_cds, color_by = "cluster") +
#facet_wrap(~cluster)
#p <- plot_cell_trajectory(my_cds, color_by = "sample") +
#facet_wrap(~sample)
#p <- plot_cell_trajectory(my_cds, color_by = "group") +
#facet_wrap(~group)
monocle繪圖函數(shù)基于ggplot2,所以可以使用ggplot語法對圖片進(jìn)行個(gè)性化修改:顏色、點(diǎn)大小、圖例等
10 monocle對象
#使用'plot_cell_trajectory()'函數(shù)繪圖時(shí),'color_by'參數(shù)指定的字符必須在數(shù)據(jù)框my_cds@phenoData@data中,如果沒有需要添加
11 關(guān)注基因在軌跡圖譜上的表達(dá)可視化
pData(my_cds)[['NOC2L']] = log2(exprs(my_cds)['NOC2L',] + 1)
p <-plot_cell_trajectory(my_cds, color_by = 'NOC2L') + scale_color_gsea()
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_gene.pdf")
print(p)
dev.off()
細(xì)胞功能富集分析
1 簡介

1.1 富集分析的基本定義
在進(jìn)行單細(xì)胞分析時(shí)我們往往會(huì)得到許多差異基因列表,而在列表中,我們往往很難得到一些有用的信息。
富集分析:基于差異基因列表,結(jié)合相應(yīng)數(shù)據(jù)庫信息和一定的統(tǒng)計(jì)學(xué)方法,將差異基因信息整合轉(zhuǎn)化為具有生物學(xué)意義的通路信息。
常用分析:GO,KEGG,GSEA,GSVA等。其中GO和KEGG是兩個(gè)數(shù)據(jù)庫,里面有每個(gè)基因相關(guān)的功能信息,而富集分析就是把這些功能進(jìn)行進(jìn)行整合計(jì)算的算法。
1.2 富集分析常用方法
ORA: Over-expression Analysis(過表達(dá)分析)
FCS: Functional Class Scoring(功能集打分)
-
PT: Pathway Topology(通路拓?fù)鋵W(xué))
image.png
1.2.1 ORA(Over-expression Analysis):
又稱為“2X2法”
具體步驟:
1.獲得一組感興趣的基因(一般是差異表達(dá)基因)。
2.給定的基因列表與某個(gè)通路中的基因集做交集,找出其中共同的基因并進(jìn)行計(jì)數(shù)(統(tǒng)計(jì)值)。
3.利用統(tǒng)計(jì)檢驗(yàn)的方式來評估觀察的計(jì)數(shù)值是否顯著高于隨機(jī),即待測功能集在基因列表中是否顯著富集。
(最常用的統(tǒng)計(jì)檢驗(yàn)包括:超幾何分布、卡方檢驗(yàn)、二項(xiàng)分布。)
優(yōu)點(diǎn):
統(tǒng)計(jì)學(xué)理論較為完備,結(jié)果穩(wěn)健可。
缺點(diǎn):
1.人為因素:設(shè)定閾值獲取輸入基因集。
2.靈敏性低。
3.假設(shè)基因獨(dú)立,忽視通路內(nèi)部生物學(xué)意義的影響。
4.假設(shè)通路間獨(dú)立。
1.2.2 FCS(Functional Class Scoring):
1.對基因表達(dá)譜中基因表達(dá)水平差異值進(jìn)行打分或者排序(或者輸入排序過的基因表達(dá)譜)。
2.依據(jù)特定統(tǒng)計(jì)模型,將待測功能基因集中基因的分?jǐn)?shù)或統(tǒng)計(jì)值轉(zhuǎn)換為基因集的分?jǐn)?shù)或統(tǒng)計(jì)值。
3.通過多次隨機(jī)抽樣獲得不同的抽樣條件下待測功能基因集分?jǐn)?shù)的背景分布,并基于背景分布檢驗(yàn)實(shí)際觀測值的顯著水平,以此判斷待測基因集在實(shí)驗(yàn)對照條件下是否發(fā)生了統(tǒng)計(jì)學(xué)上的顯著變化。
優(yōu)點(diǎn):
相較于ORA,將基因表達(dá)值信息加入考慮范圍,并以待測基因集為對象進(jìn)行檢驗(yàn),更靈敏。
缺點(diǎn):
與ORA類似,假設(shè)通路與基因獨(dú)立,忽略了基因的生物學(xué)屬性與基因間復(fù)雜的相互作用關(guān)系。
1.2.3 PT(Pathway Topology):
把基因在通路中的位置(上下游關(guān)系),與其他基因的連接度和調(diào)控作用類型等信息綜合在一起來評估每個(gè)基因?qū)ν返呢暙I(xiàn)并給予相應(yīng)的權(quán)重,然后再把基因的權(quán)重整合入功能富集分析。不同的PT方法在具體的權(quán)重打分時(shí),采用了不同的方式。
優(yōu)點(diǎn):
對于研究較完善、拓?fù)浣Y(jié)構(gòu)完整的通路,基于PT的基因功能富集算法會(huì)有更強(qiáng)的顯著性。
缺點(diǎn):
對于通路拓?fù)浣Y(jié)構(gòu)存在依賴性,該類方法對于研究較少、信息不完善的通路穩(wěn)健性較差,因此目前通路注釋的不完善也是限制基于PT的基因功能富集分析方法進(jìn)一步發(fā)展的重要因素。
2 實(shí)操部分:
2.1 clusterprofiler安裝方法
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#BiocManager::install("clusterProfiler")
#BiocManager::install("org.Hs.eg.db")

2.2 獲取差異基因
利用官方數(shù)據(jù)pbmc3k_final.rds,計(jì)算CD14+ Mono和FCGR3A+ Mono的差異基因
pbmc <- readRDS("D:/singleron/項(xiàng)目/其他/2021生信培訓(xùn)/富集分析/pbmc3k_final.rds")
pbmc@meta.data$celltype <- pbmc@active.ident
#獲取細(xì)胞間的差異基因
cells1 <- subset(pbmc@meta.data, celltype %in% c("CD14+ Mono")) %>% rownames()
cells2 <- subset(pbmc@meta.data, celltype %in% c("FCGR3A+ Mono")) %>% rownames()
deg <- FindMarkers(pbmc, ident.1 = cells1, ident.2 = cells2)
deg <- data.frame(gene = rownames(deg), deg)
head(deg)
## gene p_val avg_logFC pct.1 pct.2 p_val_adj
## FCGR3A FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97
## LYZ LYZ 8.134552e-75 1.812078 1.000 0.988 1.115572e-70
## RHOC RHOC 4.479768e-68 -1.611576 0.162 0.864 6.143554e-64
## S100A8 S100A8 7.471811e-65 2.610696 0.975 0.500 1.024684e-60
## S100A9 S100A9 1.318422e-64 2.286734 0.996 0.870 1.808084e-60
## IFITM2 IFITM2 4.821669e-64 -1.445772 0.677 1.000 6.612437e-60
deg1 <- deg
logFC_t=0.5
P.Value_t = 0.05
k1 = (deg1$p_val_adj < P.Value_t)&(deg1$avg_logFC < -logFC_t)
k2 = (deg1$p_val_adj < P.Value_t)&(deg1$avg_logFC > logFC_t)
table(k1)
## k1
## FALSE TRUE
## 569 94
table(k2)
## k2
## FALSE TRUE
## 620 43
#分組 上調(diào)/下調(diào)
change = ifelse(k1,"down",ifelse(k2,"up","stable"))
deg1$change <- change
head(deg1)
## gene p_val avg_logFC pct.1 pct.2 p_val_adj change
## FCGR3A FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97 down
## LYZ LYZ 8.134552e-75 1.812078 1.000 0.988 1.115572e-70 up
## RHOC RHOC 4.479768e-68 -1.611576 0.162 0.864 6.143554e-64 down
## S100A8 S100A8 7.471811e-65 2.610696 0.975 0.500 1.024684e-60 up
## S100A9 S100A9 1.318422e-64 2.286734 0.996 0.870 1.808084e-60 up
## IFITM2 IFITM2 4.821669e-64 -1.445772 0.677 1.000 6.612437e-60 down
#基因名稱轉(zhuǎn)換
s2e <- bitr(deg1$gene,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)#人類 轉(zhuǎn)換成ENTREZID
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(deg1$gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db): 5.73% of input
## gene IDs are fail to map...
deg1 <- inner_join(deg1,s2e,by=c("gene"="SYMBOL"))
head(deg1)
## gene p_val avg_logFC pct.1 pct.2 p_val_adj change ENTREZID
## 1 FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97 down 2214
## 2 LYZ 8.134552e-75 1.812078 1.000 0.988 1.115572e-70 up 4069
## 3 RHOC 4.479768e-68 -1.611576 0.162 0.864 6.143554e-64 down 389
## 4 S100A8 7.471811e-65 2.610696 0.975 0.500 1.024684e-60 up 6279
## 5 S100A9 1.318422e-64 2.286734 0.996 0.870 1.808084e-60 up 6280
## 6 IFITM2 4.821669e-64 -1.445772 0.677 1.000 6.612437e-60 down 10581
#GO富集分析差異基因列表[Symbol]
gene_up = deg1[deg1$change == 'up','gene']
gene_down = deg1[deg1$change == 'down','gene']
gene_diff = c(gene_up,gene_down)
#KEGG富集分析差異基因列表[ENTREZID]
gene_all = deg1[,'ENTREZID']
gene_up_KEGG = deg1[deg1$change == 'up','ENTREZID']
gene_down_KEGG = deg1[deg1$change == 'down','ENTREZID']
gene_diff_KEGG = c(gene_up_KEGG,gene_down_KEGG)
2.3 GO富集分析
富集分析的本質(zhì)是什么呢?富集表示差異基因中注釋到某個(gè)代謝通路的基因數(shù)目在所有差異基因中的比例顯著大于背景基因中注釋到某個(gè)代謝通路的基因數(shù)目在所有背景基因中的比例。
從統(tǒng)計(jì)上來講富集上就是一個(gè)超幾何分布,超幾何分布是統(tǒng)計(jì)學(xué)上一種離散概率分布。它描述了從有限N個(gè)物件(其中包含M個(gè)指定種類的物件)中抽出n個(gè)物件,成功抽出該指定種類的物件的次數(shù)(不放回)。最常用的統(tǒng)計(jì)檢驗(yàn)包括:超幾何分布、卡方檢驗(yàn)、二項(xiàng)分布。

N:所有通路總基因數(shù)目
n:差異基因列表中基因數(shù)目
M:指定通路的基因數(shù)目
i(k):差異基因列表中富集在指定通路的基因數(shù)目
GO數(shù)據(jù)庫,全稱是Gene Ontology(基因本體),他們把基因的功能分成了三個(gè)部分分別是:細(xì)胞組分(cellular component, CC)、分子功能(molecular function, MF)、生物過程(biological process, BP)。利用GO數(shù)據(jù)庫,我們就可以得到我們的目標(biāo)基因在CC, MF和BP三個(gè)層面上,主要和什么有關(guān)。
Gene Ontology:
BP:通過多種分子活動(dòng)完成的生物學(xué)過程。
MF:單個(gè)的基因產(chǎn)物(包括蛋白質(zhì)和RNA)或多個(gè)基因產(chǎn)物的復(fù)合物在分子水平上的活動(dòng)。
CC:基因產(chǎn)物在執(zhí)行功能時(shí)所處的細(xì)胞結(jié)構(gòu)位置。
[圖片上傳中...(image-da7175-1651394285751-11)]
2.3.1 GO富集分析代碼示例
#以上調(diào)基因的富集分析為例子
#細(xì)胞組分
ego_CC <- enrichGO(gene = gene_up,
keyType = 'SYMBOL', #基因ID的類型
OrgDb = org.Hs.eg.db, #包含人注釋信息的數(shù)據(jù)庫
ont = "CC",
pAdjustMethod = "BH", #指定多重假設(shè)檢驗(yàn)矯正的方法
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(data.frame(ego_CC))
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue
## GO:0034774 GO:0034774 secretory granule lumen 10/42 321/19717 1.055125e-09 5.596029e-08 4.386585e-08
## GO:0060205 GO:0060205 cytoplasmic vesicle lumen 10/42 338/19717 1.735874e-09 5.596029e-08 4.386585e-08
## GO:0031983 GO:0031983 vesicle lumen 10/42 339/19717 1.785967e-09 5.596029e-08 4.386585e-08
## GO:1904724 GO:1904724 tertiary granule lumen 3/42 55/19717 2.182913e-04 5.129845e-03 4.021155e-03
## GO:0035580 GO:0035580 specific granule lumen 3/42 62/19717 3.114424e-04 5.855118e-03 4.589678e-03
## geneID Count
## GO:0034774 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT 10
## GO:0060205 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT 10
## GO:0031983 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT 10
## GO:1904724 LYZ/FOLR3/QPCT 3
## GO:0035580 LYZ/FOLR3/QPCT 3
#生物過程
ego_BP <- enrichGO(gene = gene_up,
OrgDb = org.Hs.eg.db,
keyType = 'SYMBOL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
#分子功能
ego_MF <- enrichGO(gene = gene_up,
OrgDb = org.Hs.eg.db,
keyType = 'SYMBOL',
ont = "MF",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
save(ego_CC,ego_BP,ego_MF,file = "GO.Rdata")
#細(xì)胞組分、分子功能、生物學(xué)過程
go <- enrichGO(gene = gene_up, OrgDb = "org.Hs.eg.db", keyType = 'SYMBOL',ont="all")
2.3.2 GO富集結(jié)果可視化
dotplot(ego_CC, showCategory=30)

barplot(ego_CC)

p <- dotplot(go, split="ONTOLOGY") +facet_grid(ONTOLOGY~., scale="free")
p

2.4 KEGG富集分析
KEGG(Kyoto Encyclopedia of Genes and Genomes)數(shù)據(jù)庫是系統(tǒng)地分析基因功能、鏈接基因組信息和功能信息的數(shù)據(jù)庫,包括代謝通路(pathway)數(shù)據(jù)庫、分層分類數(shù)據(jù)庫、基因數(shù)據(jù)庫、基因組數(shù)據(jù)庫等。KEGG的pathway數(shù)據(jù)庫是應(yīng)用最廣泛的代謝通路公共數(shù)據(jù)庫。顯著性計(jì)算方法為超幾何檢驗(yàn)。
#KEGG全部的物種及其簡寫:https://www.genome.jp/kegg/catalog/org_list.html
#pathway對應(yīng)的描述信息,比如人的:http://rest.kegg.jp/list/pathway/hsa
[圖片上傳中...(image-8cda06-1651394285751-10)]
2.4.1 KEGG富集分析代碼示例
#上調(diào)基因富集
kk.up <- enrichKEGG(gene = gene_up_KEGG, #注意這里只能用 entrzeid
organism = 'hsa',
universe = gene_all, ##背景基因集,可省
pvalueCutoff = 0.9, ##指定 p 值閾值,不顯著的值將不顯示在結(jié)果中
qvalueCutoff = 0.9)
#下調(diào)基因富集
kk.down <- enrichKEGG(gene = gene_down_KEGG,
organism = 'hsa',
universe = gene_all,
pvalueCutoff = 0.9,
qvalueCutoff =0.9)
kk.diff <- enrichKEGG(gene = gene_diff_KEGG,
organism = 'hsa',
pvalueCutoff = 0.9)
save(kk.diff,kk.down,kk.up,file = "GSE4107kegg.Rdata")
#從富集結(jié)果中提取結(jié)果數(shù)據(jù)框
ekegg <- setReadable(kk.up, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
kegg_diff_dt <- data.frame(ekegg)
head(kegg_diff_dt)
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue
## hsa04640 hsa04640 Hematopoietic cell lineage 4/28 14/364 0.01647484 0.6593479 0.6593479
## hsa05140 hsa05140 Leishmaniasis 4/28 16/364 0.02685361 0.6593479 0.6593479
## hsa01100 hsa01100 Metabolic pathways 8/28 60/364 0.06925488 0.6593479 0.6593479
## hsa04380 hsa04380 Osteoclast differentiation 4/28 22/364 0.07783745 0.6593479 0.6593479
## hsa04514 hsa04514 Cell adhesion molecules 3/28 14/364 0.08361902 0.6593479 0.6593479
## hsa05152 hsa05152 Tuberculosis 4/28 23/364 0.08924392 0.6593479 0.6593479
## geneID Count
## hsa04640 CD14/CSF3R/FCGR1A/HLA-DMA 4
## hsa05140 NCF1/NFKBIA/FCGR1A/HLA-DMA 4
## hsa01100 GPX1/GSTP1/GAPDH/BLVRB/ALDH2/TALDO1/APRT/SAT2 8
## hsa04380 NCF1/NFKBIA/FOSB/FCGR1A 4
## hsa04514 CD99/VCAN/HLA-DMA 3
## hsa05152 CD14/FCGR1A/CLEC4E/HLA-DMA 4
2.4.2 KEGG結(jié)果可視化
p1 <- barplot(ekegg, showCategory=10)
p2 <- dotplot(ekegg, showCategory=10)
plotc = p1/p2
plotc

up_kegg <- kk.up@result %>%
filter(pvalue<0.01) %>%
mutate(group=1)
down_kegg <- kk.down@result %>%
filter(pvalue<0.05) %>% #篩選行
mutate(group=-1) #新增列
#可視化
#barplot(up_kegg, showCategory = 10)
#g_kegg <- kegg_plot(up_kegg,down_kegg)
#g_kegg
#g_kegg +scale_y_continuous(labels = c(15,10,5,0,5,10,15,20,25,30))
2.5 GSEA富集分析
GSEA(Gene Set Enrichment Analysis):基因集富集分析,由Broad Institute研究所提出的一種富集方法。
GSEA的輸入是一個(gè)基因表達(dá)量矩陣,其中的樣本分成了A和B兩組,首先對所有基因進(jìn)行排序,簡單理解就是根據(jù)處理后的差異倍數(shù)值進(jìn)行從大到小排序,用來表示基因在兩組間的表達(dá)量變化趨勢。
排序之后的基因列表其頂部可看做是上調(diào)的差異基因,其底部是下調(diào)的差異基因。

GSEA分析的是一個(gè)基因集下的所有基因是富集在這個(gè)排序列表的頂部還是底部,如果在頂部富集,可以說,從總體上看,該基因集是上調(diào)趨勢,反之,如果在底部富集,則是下調(diào)趨勢。
GSEA富集分析的優(yōu)點(diǎn):
1.傳統(tǒng)的GO和KEGG是針對有差異的基因進(jìn)行富集分析,GSEA不需要對基因進(jìn)行顯著差異的篩選,可以保留變化不大但功能重要的基因信息
2.分析的是單個(gè)基因集合。
2.5.1 GSEA代碼
library(dplyr)
library(GSEABase)
library(clusterProfiler)
library(DOSE)
library(org.Hs.eg.db)
library(ggplot2)
library(stringr)
library(enrichplot)
options(stringsAsFactors = F)
geneList = deg1$avg_logFC
names(geneList) = deg1$gene
geneList = sort(geneList,decreasing = T)
geneList[1:10]
## S100A8 S100A9 LGALS2 LYZ MS4A6A CD14 CCL3 S100A12 GPX1 FOLR3
## 2.610696 2.286734 2.049431 1.812078 1.645181 1.630356 1.531812 1.249115 1.238310 1.120697
#準(zhǔn)備gmt文件
#構(gòu)建gmt文件方法:http://www.itdecent.cn/p/9a6a90697c25
geneset <- read.gmt("D:/singleron/項(xiàng)目/其他/2021生信培訓(xùn)/富集分析/h.all.v7.4.symbols.gmt")
geneset$ont = str_remove(geneset$ont,"HALLMARK_")
head(geneset)
## ont gene
## 1 TNFA_SIGNALING_VIA_NFKB JUNB
## 2 TNFA_SIGNALING_VIA_NFKB CXCL2
## 3 TNFA_SIGNALING_VIA_NFKB ATF3
## 4 TNFA_SIGNALING_VIA_NFKB NFKBIA
## 5 TNFA_SIGNALING_VIA_NFKB TNFAIP3
## 6 TNFA_SIGNALING_VIA_NFKB PTGS2
egmt <- GSEA(geneList, TERM2GENE=geneset,verbose=F,pvalueCutoff = 0.5)
#結(jié)果轉(zhuǎn)化
y=data.frame(egmt)
head(y)
## ID Description setSize enrichmentScore
## INTERFERON_ALPHA_RESPONSE INTERFERON_ALPHA_RESPONSE INTERFERON_ALPHA_RESPONSE 14 -0.5434808
## NES pvalue p.adjust qvalues rank leading_edge
## INTERFERON_ALPHA_RESPONSE -1.923206 0.005597015 0.1287313 0.1287313 266 tags=93%, list=43%, signal=55%
## core_enrichment
## INTERFERON_ALPHA_RESPONSE HLA-C/PSMB9/CD47/ADAR/LY6E/IFI30/GBP2/OAS1/B2M/CASP1/IFITM3/IFITM1/IFITM2
2.5.2 GSEA結(jié)果可視化
#經(jīng)典gseaplot
gseaplot2(egmt, geneSetID = 1, title = egmt$Description[1])

2.6 GSVA富集分析
GSVA(gene set variation analysis),通過將基因在不同樣品間的表達(dá)矩陣轉(zhuǎn)化成基因集在樣品間的表達(dá)矩陣,從而來評估不同的代謝通路在不同樣品間是否富集。

富集過程主要包含一下四步:
1.評估基因i在樣品j中是高表達(dá)還是低表達(dá):累積密度函數(shù)的核估計(jì),得到每個(gè)樣本的表達(dá)水平統(tǒng)計(jì)
2.對每個(gè)樣本的表達(dá)水平統(tǒng)計(jì)進(jìn)行排序和標(biāo)準(zhǔn)化
3.計(jì)算每個(gè)基因集的類KS隨機(jī)游走統(tǒng)計(jì)量
4.將類KS隨機(jī)游走統(tǒng)計(jì)量轉(zhuǎn)化為ES:最大偏離量(雙峰),最大正負(fù)偏離量之差(近似正態(tài)分布)
2.6.1 GSVA富集示例
library(GSVA)
expr <- as.data.frame(pbmc@assays$RNA@data)
expr[1:10,1:10]
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC
## AL627309.1 0 0 0 0 0 0
## AP006222.2 0 0 0 0 0 0
## RP11-206L10.2 0 0 0 0 0 0
## RP11-206L10.9 0 0 0 0 0 0
## LINC00115 0 0 0 0 0 0
## NOC2L 0 0 0 0 0 0
## KLHL17 0 0 0 0 0 0
## PLEKHN1 0 0 0 0 0 0
## RP11-54O7.17 0 0 0 0 0 0
## HES4 0 0 0 0 0 0
## AAACGCTGACCAGT AAACGCTGGTTCTT AAACGCTGTAGCCA AAACGCTGTTTCTG
## AL627309.1 0 0 0 0
## AP006222.2 0 0 0 0
## RP11-206L10.2 0 0 0 0
## RP11-206L10.9 0 0 0 0
## LINC00115 0 0 0 0
## NOC2L 0 0 0 0
## KLHL17 0 0 0 0
## PLEKHN1 0 0 0 0
## RP11-54O7.17 0 0 0 0
## HES4 0 0 0 0
expr=as.matrix(expr)
kegg_list = split(geneset$gene, geneset$ont)
kegg_list[1:2]
## $ADIPOGENESIS
## [1] "FABP4" "ADIPOQ" "PPARG" "LIPE" "DGAT1" "LPL" "CPT2" "CD36" "GPAM"
## [10] "ADIPOR2" "ACAA2" "ETFB" "ACOX1" "ACADM" "HADH" "IDH1" "SORBS1" "ACADS"
## [19] "UCK1" "SCP2" "DECR1" "CDKN2C" "TALDO1" "TST" "MCCC1" "PGM1" "REEP5"
## [28] "BCL2L13" "SLC25A10" "ME1" "PHYH" "PIM3" "YWHAG" "NDUFAB1" "GPD2" "ADIG"
## [37] "ECHS1" "QDPR" "CS" "ECH1" "SLC25A1" "ACADL" "TOB1" "GRPEL1" "CRAT"
## [46] "GBE1" "CAVIN2" "SCARB1" "PEMT" "CHCHD10" "AK2" "APOE" "UQCR10" "TANK"
## [55] "ANGPTL4" "ACO2" "FAH" "ACLY" "IFNGR1" "SLC5A6" "JAGN1" "EPHX2" "IDH3G"
## [64] "GPX3" "ELMOD3" "ORM1" "RETSAT" "ESRRA" "HIBCH" "SUCLG1" "STAT5A" "ITGA7"
## [73] "MRAP" "PLIN2" "CYC1" "ALDH2" "RNF11" "ALDOA" "SULT1A1" "DDT" "SDHB"
## [82] "CD151" "SLC27A1" "BCKDHA" "C3" "LEP" "ADCY6" "ELOVL6" "LTC4S" "SPARCL1"
## [91] "RMDN3" "MTCH2" "SOWAHC" "SLC1A5" "CMPK1" "REEP6" "NDUFA5" "FZD4" "DRAM2"
## [100] "MGST3" "ATP1B3" "RETN" "STOM" "ESYT1" "GHITM" "DNAJC15" "GADD45A" "VEGFB"
## [109] "PFKL" "COQ3" "NABP1" "CYP4B1" "PPM1B" "ARAF" "CAVIN1" "COL4A1" "IMMT"
## [118] "DHRS7" "COL15A1" "NMT1" "COQ5" "LAMA4" "AGPAT3" "BAZ2A" "IDH3A" "LIFR"
## [127] "PREB" "PTGER3" "GPHN" "PFKFB3" "GPX4" "SSPN" "SQOR" "MTARC2" "DLD"
## [136] "ITIH5" "CD302" "ATL2" "GPAT4" "LPCAT3" "TKT" "UQCRC1" "CAT" "OMD"
## [145] "DLAT" "MRPL15" "RIOK3" "RTN3" "CHUK" "G3BP2" "SDHC" "SAMM50" "ARL4A"
## [154] "SNCG" "PDCD4" "COQ9" "APLP2" "SOD1" "PTCD3" "PHLDB1" "ENPP2" "HSPB8"
## [163] "AIFM1" "CCNG2" "PPP1R15B" "MDH2" "ABCA1" "COX7B" "MYLK" "COX8A" "DHRS7B"
## [172] "MIGA2" "MGLL" "ITSN1" "DHCR7" "RREB1" "CMBL" "UBC" "ATP5PO" "PRDX3"
## [181] "DBT" "NDUFS3" "NKIRAS1" "RAB34" "CIDEA" "UQCRQ" "PEX14" "BCL6" "COX6A1"
## [190] "DNAJB9" "MAP4K3" "ANGPT1" "UBQLN1" "NDUFB7" "SLC19A1" "ABCB8" "SLC66A3" "POR"
## [199] "UCP2" "UQCR11"
##
## $ALLOGRAFT_REJECTION
## [1] "PTPRC" "IL12B" "TGFB1" "IL12A" "CD3E" "CD3D" "CD28" "LYN" "HCLS1"
## [10] "IL18" "CRTAM" "IFNG" "CD3G" "CD86" "IL10" "UBE2N" "BCL10" "CD4"
## [19] "LCK" "NCK1" "C2" "HLA-A" "ITGB2" "HLA-DQA1" "CD1D" "CD80" "HLA-DRA"
## [28] "THY1" "TLR1" "HLA-G" "HLA-DMB" "IL7" "IL4" "TNF" "CD247" "IL2"
## [37] "HLA-DMA" "STAT1" "IRF4" "SRGN" "INHBA" "TLR3" "ZAP70" "CD74" "CD40"
## [46] "TRAF2" "B2M" "BCL3" "LTB" "IFNGR1" "CCR5" "CD40LG" "HLA-DOA" "GLMN"
## [55] "IL6" "HLA-E" "CD2" "CCL5" "FAS" "FASLG" "TLR6" "PF4" "TGFB2"
## [64] "CD79A" "INHBB" "ELANE" "SPI1" "MAP3K7" "IL15" "CTSS" "CD47" "PRF1"
## [73] "IL12RB1" "LCP2" "SOCS1" "CDKN2A" "STAT4" "CD7" "HLA-DOB" "CD8A" "ICAM1"
## [82] "CCL4" "GZMB" "CSF1" "IL11" "STAB1" "IL2RA" "NLRP3" "CCND3" "EIF3A"
## [91] "SIT1" "IFNAR2" "HDAC9" "CARTPT" "TRAT1" "CCL22" "APBB1" "FYB1" "IL1B"
## [100] "TIMP1" "RPS19" "JAK2" "KRT1" "WARS1" "IFNGR2" "CCR2" "EREG" "MMP9"
## [109] "EGFR" "IL16" "CFP" "WAS" "ITGAL" "KLRD1" "RARS1" "TLR2" "CCND2"
## [118] "IL2RG" "ETS1" "ITK" "NCR1" "MAP4K1" "CCL19" "PSMB10" "RPL39" "EIF3J"
## [127] "ABCE1" "CD8B" "F2" "ELF4" "LY86" "FCGR2B" "GBP2" "PRKCG" "RPS9"
## [136] "MTIF2" "GZMA" "AARS1" "CD96" "CSK" "HIF1A" "CCL2" "ICOSLG" "NPM1"
## [145] "IL4R" "CCL11" "NME1" "FLNA" "GPR65" "ACHE" "EIF3D" "IGSF6" "F2R"
## [154] "IL13" "TAP1" "DARS1" "IRF7" "ACVR2A" "CXCR3" "PRKCB" "CXCL9" "PTPN6"
## [163] "NCF4" "UBE2D1" "LIF" "CCR1" "MBL2" "DEGS1" "TPD52" "AKT1" "RIPK2"
## [172] "IKBKB" "GCNT1" "SOCS5" "IRF8" "TAP2" "EIF4G3" "ABI1" "CCL7" "IL2RB"
## [181] "BRCA1" "FGR" "IL18RAP" "MRPL3" "CXCL13" "CAPG" "EIF5A" "RPS3A" "GALNT1"
## [190] "ST8SIA4" "CCL13" "RPL3L" "LY75" "TAPBP" "NOS2" "RPL9" "BCAT1" "IL9"
## [199] "IL27RA" "DYRK3"
kegg2 <- gsva(expr, kegg_list, kcdf="Gaussian",method = "gsva",parallel.sz=12)
## Warning in .local(expr, gset.idx.list, ...): 1 genes with constant expression values throuhgout the samples.
## Warning in .local(expr, gset.idx.list, ...): Since argument method!="ssgsea", genes with constant expression
## values are discarded.
## Estimating GSVA scores for 50 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs with Gaussian kernels
## Using parallel with 8 cores
##
|
| | 0%
|
|============== | 14%
|
|============================= | 29%
|
|=========================================== | 43%
|
|========================================================= | 57%
|
|======================================================================= | 71%
|
|====================================================================================== | 86%
|
|====================================================================================================| 100%
write.table(kegg2,"D:/singleron/項(xiàng)目/其他/2021生信培訓(xùn)/富集分析/kegg2.txt",sep="\t",quote = F)
data <- read.table('D:/singleron/項(xiàng)目/其他/2021生信培訓(xùn)/富集分析/kegg2.txt',header = TRUE)
data[1:10,1:10]
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG
## ADIPOGENESIS -0.31805820 -0.18462905 -0.15811209 -0.154243884 -0.37283740
## ALLOGRAFT_REJECTION -0.19507167 -0.03794476 -0.18071016 -0.165990361 -0.20320035
## ANDROGEN_RESPONSE -0.23005329 -0.16156074 -0.04563213 -0.235821291 -0.15594740
## ANGIOGENESIS -0.15038280 -0.19228819 -0.22088828 0.246071801 0.21906782
## APICAL_JUNCTION -0.03969773 0.10783847 0.03563322 0.066345815 0.10212727
## APICAL_SURFACE -0.03905613 -0.21580357 -0.01430062 -0.339748444 0.14688488
## APOPTOSIS -0.20717708 -0.38186444 -0.03548629 -0.121050533 -0.33615891
## BILE_ACID_METABOLISM 0.06506092 0.09057561 0.17393321 0.001260199 -0.11349065
## CHOLESTEROL_HOMEOSTASIS -0.23527893 -0.30489279 -0.18848634 -0.078379978 -0.22117629
## COAGULATION -0.12148690 -0.04065053 -0.08133865 -0.016734377 0.06458631
## AAACGCACTGGTAC AAACGCTGACCAGT AAACGCTGGTTCTT AAACGCTGTAGCCA AAACGCTGTTTCTG
## ADIPOGENESIS -0.180867145 -0.215086765 -0.27301188 -0.29392780 -0.22786709
## ALLOGRAFT_REJECTION -0.270354626 -0.175938344 -0.11045626 -0.34202594 -0.32641029
## ANDROGEN_RESPONSE -0.088974281 -0.179532791 -0.11313698 -0.24368755 -0.21325196
## ANGIOGENESIS 0.058400640 -0.050501579 -0.02987406 0.12116168 0.33073463
## APICAL_JUNCTION 0.100274333 -0.002669347 -0.09365991 0.06416936 0.03224153
## APICAL_SURFACE -0.029738394 -0.162577894 -0.15939289 -0.09816950 -0.24598109
## APOPTOSIS -0.256041490 -0.336932260 -0.15765883 -0.27774790 -0.22567893
## BILE_ACID_METABOLISM 0.123196649 -0.049568414 0.01197374 0.05770083 -0.03440201
## CHOLESTEROL_HOMEOSTASIS -0.158308258 -0.327274194 -0.18463169 -0.27325823 -0.13552362
## COAGULATION -0.008076284 0.021259688 0.08345908 0.08178405 0.18399532
2.6.2 GSVA結(jié)果可視化
meta <- as.data.frame(pbmc@meta.data[,c('orig.ident',"celltype")])
#細(xì)胞按照細(xì)胞類型排序
meta <- meta %>% arrange(meta$celltype)
data <- data[,rownames(meta)]
#取各細(xì)胞類型對應(yīng)的通路score的均值
identical(colnames(data),rownames(meta))
## [1] TRUE
data$CD4_T <- apply(data[,1:697], 1, mean)
data$Memory_CD4_T <- apply(data[,698:1181], 1, mean)
data$CD14_Mono <- apply(data[,1182:1662], 1, mean)
data$B <- apply(data[,1663:2007], 1, mean)
data$CD8_T <- apply(data[,2007:2279], 1, mean)
table(meta$celltype3)
## < table of extent 0 >
test <- data[,c("CD4_T","Memory_CD4_T","CD14_Mono","B","CD8_T")]
pathway <- c("COAGULATION","COMPLEMENT","DNA_REPAIR","E2F_TARGETS","EPITHELIAL_MESENCHYMAL_TRANSITION","ESTROGEN_RESPONSE_EARLY","ESTROGEN_RESPONSE_LATE","FATTY_ACID_METABOLISM","G2M_CHECKPOINT","GLYCOLYSIS","HEDGEHOG_SIGNALING")
test1 <- test[pathway,]
result_plot<- t(scale(t(test1)))
library(pheatmap)
G1 <- pheatmap(result_plot,
cluster_rows = F,
cluster_cols = F,
show_rownames = T,
show_colnames = T,
color =colorRampPalette(c("blue", "white","red"))(100),
cellwidth = 10, cellheight = 15,
fontsize = 10)

pdf(("D:/singleron/項(xiàng)目/其他/2021生信培訓(xùn)/富集分析/G1.pdf"),width = 7,height = 7)
參考文獻(xiàn)
1、Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375\. doi: 10.1371/journal.pcbi.1002375\. Epub 2012 Feb 23\. PMID: 22383865; PMCID: PMC3285573.
