實在懶得找meme了
GOrilla, clusterProfiler, topGO 分析 ranked gene list (GSEA) 對比
Genelist preparation
保證了三個工具輸入的基因是一致的。
# genelist preparation
library(hgu133plus2.db)
rankedprobe <- row.names(DEGdf)
load("GSE76275_DEGs.Rdata")
## ranked entrezid by pval
rankedentrezt <- AnnotationDbi::select(hgu133plus2.db,
rankedprobe,
"ENTREZID",
"PROBEID")
rankedentrez <- unique(na.omit(rankedentrezt$ENTREZID))
for GOrilla
糾正一個誤解,之前以為 GOrilla 要求的 gene list 是按表達量從高到低排序。
This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression).
但實際上是依然是基于t檢驗,也就是基于p值。
All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups.
GOrilla 的優(yōu)勢主要體現(xiàn)在:
- 與基于超幾何分布的工具相比——閾值靈活
- 與其他 閾值靈活/類似GSEA/利用Kolmogorov-Smirnov檢驗 的工具相比:
- 專注于GO,可直接生成DAG
- 利用 "mHG " 可得到精確的 p.val
- 速度快
- (似乎放在現(xiàn)在這些都不是什么問題,畢竟 GSEA 2005年發(fā)布,GOrilla 2008年發(fā)布...不過一布拿到圖+表也是很爽的額 (。﹏。)
## ranked symbol (increasing by pval)
rankedsymbol <- AnnotationDbi::select(hgu133plus2.db,
rankedentrez,
"SYMBOL",
"ENTREZID")[,2]
save(rankedsymbol, file = "rankedsymbol_pvbased_forGOrilla.Rdata")
write.table(rankedsymbol, file =
"rankedsymbol_pvbased_forGOrilla.txt",
quote = F, col.names = F, row.names = F)
手動選擇 pvalue cutoff

for clusterProfiler
## ranked entrezid with pval (increasing by pval)
matchedprobe <- match(rankedentrez,
rankedentrezt$ENTREZID)
matchedprobe2 <- rankedentrezt[matchedprobe,][,"PROBEID"]
respon.pval <- DEGdf[matchedprobe2,]$P.Value
entrezlist_pval <- respon.pval
names(entrezlist_pval) <- rankedentrez
save(entrezlist_pval, file = "genelist_ENTREZID_pVal.Rdata")
### decreasing entrezid list by pval
entrezlist_dcrpv <- sort(entrezlist_pval, decreasing = T)
save(entrezlist_dcrpv, file = "genelist_ENTREZID_Decr_pVal.Rdata")
### decreasing entrezid list by FC
entrezlist_FC <- DEGdf[matchedprobe2,]$logFC
names(entrezlist_FC) <- rankedentrez
entrezlist_FC <- sort(entrezlist_FC, decreasing = T)
save(entrezlist_FC, file = "genelist_ENTREZID_FC.Rdata")
for topGO
## probeid with p.val list (increasing by pval)
probelist_pval <- respon.pval
names(probelist_pval) <- matchedprobe2
save(probelist_pval, file = "genelist_PROBEID_pVal.Rdata")
GOrilla
Running mode: Single ranked list of genes
很酷炫的一張大圖


重復(fù)基因會被自動刪除

但并不知道哪來的這12個 duplicate genes (+_+)?
table(table(rankedsymbol))
#
# 1
# 22012
Enrichment (N, B, n, b) is defined as follows:
N - is the total number of genes
B - is the total number of genes associated with a specific GO term
n - is the number of genes in the top of the user's input list or in the target set when appropriate
b - is the number of genes in the intersection
Enrichment = (b/n) / (B/N)

缺點是并沒有顯示富集到了多少個GO term, 強行扒下這個表后,小小操作:
GOrillares <- read.csv("GOrillatable.txt", header = T, sep = "\t")
save(GOrillares, file = "GOrillatable.Rdata")

所以就是有100個 term.
nrow(GOrillares)
# [1] 100
clusterProfiler
GSEA GO-BP
標準的 GSEA.
關(guān)于輸入的 gene list, y叔是這么說的:
The
geneListcontains three features:
- numeric vector: fold change or other type of numerical variable
- named vector: every number was named by the corresponding gene ID
- sorted vector: number should be sorted in decreasing order
但用 p.val 的效果似乎一般。
## clusterProfiler GSEA GO-BP
library(clusterProfiler)
library(hgu133plus2.db)
load("genelist_ENTREZID_Decr_pVal.Rdata")
### with decreasing p.val
gseaGO_BP <- gseGO(geneList = entrezlist_dcrpv,
OrgDb = hgu133plus2.db,
keyType = "ENTREZID",
ont = "BP",
nPerm = 1000, ## 排列數(shù)
minGSSize = 5,
maxGSSize = 500,
pvalueCutoff = 0.95,
verbose = TRUE) ## 不輸出結(jié)果
gseaGO_BPresult <- gseaGO_BP@result
save(gseaGO_BPresult, file = "cp_gseaGO_BPresult.Rdata")
可能和網(wǎng)絡(luò)也有關(guān)系?幾次輸入同樣的參數(shù),得到的結(jié)果并不一樣....有時是報錯 no term, 迷惑desu


也沒什么顯著的 term

還是像示例里那樣用 decreasing FC
### with decreasing FC
load("genelist_ENTREZID_FC.Rdata")
gseaGO_BP2 <- gseGO(geneList = entrezlist_FC,
OrgDb = hgu133plus2.db,
keyType = "ENTREZID",
ont = "BP",
nPerm = 1000, ## 排列數(shù)
minGSSize = 5,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = TRUE) ## 不輸出結(jié)果
gseaGO_BPresult2 <- gseaGO_BP2@result
save(gseaGO_BPresult2, file = "cp_gseaGO_BPresult2.Rdata")
結(jié)果果然正常多了

topGO
GSEA-like GO-BP (Using the genes score )
準確地說,也許不能直接叫做 GSEA 法。topGO 說明書里的說法是:
We will use two types of test statistics: Fisher's exact test which is based on gene counts, and a Kolmogorov-Smirnov like test which computes enrichment based on gene scores.
并且:
We can use both these tests since each gene has a score (representing how di erentially expressed a gene is) and by the means of topDiffGenes functions the genes are categorized into di erentially expressed or not di erentially expressed genes.
正因為有 topDiffGenes 這個函數(shù)的存在,相當于無形中給了 target list, 而 allGenes = 相當于 universe gene list. 所以這也許是 topGO 無論用類似 ORA 或 類似 GSEA 法構(gòu)建的 topGOdata ,算法和檢驗方法可以各種混用的原因。(maybe <@_@>

## topGO GSEA GO-BP
library(topGO)
library(hgu133plus2.db)
load("genelist_PROBEID_pVal.Rdata")
## function: topDiffGenes
topDiffGenes <- function(allScore) {
return(allScore < 0.01) ## p.val < 0.01
}
sum(topDiffGenes(probelist_pval))
# [1] 12210
## construct a topGOdata object
GSEAGOdata_BP <- new("topGOdata",
ontology = "BP",
allGenes = probelist_pval,
geneSel = topDiffGenes,
nodeSize = 5,
annot = annFUN.db,
affyLib = "hgu133plus2.db")
## Kolmogorov-Smirnov like test - classic
KSres <- runTest(GSEAGOdata_BP, algorithm = "weight01", statistic = "ks")
KSres
#
# Description:
# Ontology: BP
# 'weight01' algorithm with the 'ks' test
# 9106 GO terms scored: 203 terms with p < 0.01
# Annotation data:
# Annotated genes: 16454
# Significant genes: 9542
# Min. no. of genes annotated to a GO: 5
# Nontrivial nodes: 9106
GSEA_BPres <- GenTable(GSEAGOdata_BP,
weight01KS = KSres,
orderBy = "weight01KS",
ranksOf = "weight01KS",
topNodes = 203)
save(GSEA_BPres, file = "topGO_gseaGO_BPresult.Rdata")

結(jié)果對比
先找出三種工具結(jié)果中共有的 GO term:
nrow(GOrillares) ## GOrilla
# [1] 100
nrow(gseaGO_BPresult2) ## clusterProfiler
# [1] 457
nrow(GSEA_BPres) ## topGO
# [1] 203
GOrillavscp <- intersect(GOrillares$GO.term, gseaGO_BPresult2$ID)
GOrillavstop <- intersect(GOrillares$GO.term, GSEA_BPres$GO.ID)
triple <- intersect(GOrillavscp, GOrillavstop)
triple
# [1] "GO:0050867"
只有一個??
row.names(GOrillares) <- GOrillares$GO.term
row.names(GSEA_BPres) <- GSEA_BPres$GO.ID
theonly <- cbind(GOrillares[triple,],
gseaGO_BPresult2[triple,],
GSEA_BPres[triple,])
finetable <- subset(theonly, select = c("Description",
"Enrichment..N..B..n..b.", "P.value","FDR.q.value",
"setSize","enrichmentScore","pvalue", "p.adjust","qvalues",
"Annotated", "Significant","weight01KS"))
finetable
# Description Enrichment..N..B..n..b.
# GO:0050867 positive regulation of cell activation 35.01 (17224,328,12,8)
# P.value FDR.q.value setSize enrichmentScore pvalue
# GO:0050867 7.67e-11 6.28e-08 312 0.2948657 0.00203252
# p.adjust qvalues Annotated Significant weight01KS
# GO:0050867 0.04044123 0.03411746 313 174 0.00383

看起來, GOrilla 的結(jié)果和常用的兩個R包的結(jié)果差的還是挺多的??
出于好奇看一下 topGO 和 clusterProfiler 結(jié)果的差別
topvscp <- intersect(GSEA_BPres$GO.ID, gseaGO_BPresult2$ID)
length(topvscp)
# [1] 19

大概就是這個亞子。
最后,向大家隆重推薦生信技能樹的一系列干貨!
- 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小時生信工程師教學(xué)視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招學(xué)徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
