GO富集分析-對比Gorilla, clusterProfiler, topGO三種工具part1

實在懶得找meme了


GOrilla, clusterProfiler, topGO 分析 ranked gene list (GSEA) 對比

Genelist preparation

保證了三個工具輸入的基因是一致的。

# genelist preparation
library(hgu133plus2.db)
rankedprobe <- row.names(DEGdf)
load("GSE76275_DEGs.Rdata")
## ranked entrezid by pval
rankedentrezt <- AnnotationDbi::select(hgu133plus2.db,
                                      rankedprobe,
                                      "ENTREZID",
                                      "PROBEID")
rankedentrez <- unique(na.omit(rankedentrezt$ENTREZID))

for GOrilla

糾正一個誤解,之前以為 GOrilla 要求的 gene list 是按表達量從高到低排序。

This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression).

但實際上是依然是基于t檢驗,也就是基于p值。

All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups.

GOrilla 的優(yōu)勢主要體現(xiàn)在:

  • 與基于超幾何分布的工具相比——閾值靈活
  • 與其他 閾值靈活/類似GSEA/利用Kolmogorov-Smirnov檢驗 的工具相比:
    • 專注于GO,可直接生成DAG
    • 利用 "mHG " 可得到精確的 p.val
    • 速度快
    • (似乎放在現(xiàn)在這些都不是什么問題,畢竟 GSEA 2005年發(fā)布,GOrilla 2008年發(fā)布...不過一布拿到圖+表也是很爽的額 (。﹏。)
## ranked symbol (increasing by pval)
rankedsymbol <- AnnotationDbi::select(hgu133plus2.db,
                                       rankedentrez,
                                       "SYMBOL",
                                       "ENTREZID")[,2]
save(rankedsymbol, file = "rankedsymbol_pvbased_forGOrilla.Rdata")
write.table(rankedsymbol, file = 
              "rankedsymbol_pvbased_forGOrilla.txt",
            quote = F, col.names = F, row.names = F)

手動選擇 pvalue cutoff

for clusterProfiler

## ranked entrezid with pval (increasing by pval)
matchedprobe <- match(rankedentrez,
                      rankedentrezt$ENTREZID)
matchedprobe2 <- rankedentrezt[matchedprobe,][,"PROBEID"]
respon.pval <- DEGdf[matchedprobe2,]$P.Value
entrezlist_pval <- respon.pval
names(entrezlist_pval) <- rankedentrez
save(entrezlist_pval, file = "genelist_ENTREZID_pVal.Rdata")

### decreasing entrezid list by pval
entrezlist_dcrpv <- sort(entrezlist_pval, decreasing = T)
save(entrezlist_dcrpv, file = "genelist_ENTREZID_Decr_pVal.Rdata")

### decreasing entrezid list by FC
entrezlist_FC <- DEGdf[matchedprobe2,]$logFC
names(entrezlist_FC) <- rankedentrez
entrezlist_FC <- sort(entrezlist_FC, decreasing = T)
save(entrezlist_FC, file = "genelist_ENTREZID_FC.Rdata")

for topGO

## probeid with p.val list (increasing by pval)
probelist_pval <- respon.pval
names(probelist_pval) <- matchedprobe2
save(probelist_pval, file = "genelist_PROBEID_pVal.Rdata")

GOrilla

Running mode: Single ranked list of genes

很酷炫的一張

重復(fù)基因會被自動刪除

但并不知道哪來的這12個 duplicate genes (+_+)?

table(table(rankedsymbol))
# 
#     1 
# 22012 

Enrichment (N, B, n, b) is defined as follows:
N - is the total number of genes
B - is the total number of genes associated with a specific GO term
n - is the number of genes in the top of the user's input list or in the target set when appropriate
b - is the number of genes in the intersection
Enrichment = (b/n) / (B/N)

缺點是并沒有顯示富集到了多少個GO term, 強行扒下這個表后,小小操作:

GOrillares <- read.csv("GOrillatable.txt", header = T, sep = "\t")
save(GOrillares, file = "GOrillatable.Rdata")

所以就是有100個 term.

nrow(GOrillares)
# [1] 100

clusterProfiler

GSEA GO-BP

標準的 GSEA.

關(guān)于輸入的 gene list, y叔是這么說的:

The geneList contains three features:

  1. numeric vector: fold change or other type of numerical variable
  2. named vector: every number was named by the corresponding gene ID
  3. sorted vector: number should be sorted in decreasing order

但用 p.val 的效果似乎一般。

## clusterProfiler GSEA GO-BP
library(clusterProfiler)
library(hgu133plus2.db)
load("genelist_ENTREZID_Decr_pVal.Rdata")

### with decreasing p.val
gseaGO_BP <- gseGO(geneList     = entrezlist_dcrpv,
                   OrgDb        = hgu133plus2.db,
                   keyType      = "ENTREZID",
                   ont          = "BP",
                   nPerm        = 1000,   ## 排列數(shù)
                   minGSSize    = 5,
                   maxGSSize    = 500,
                   pvalueCutoff = 0.95,
                   verbose      = TRUE)  ## 不輸出結(jié)果
gseaGO_BPresult <- gseaGO_BP@result
save(gseaGO_BPresult, file = "cp_gseaGO_BPresult.Rdata")

可能和網(wǎng)絡(luò)也有關(guān)系?幾次輸入同樣的參數(shù),得到的結(jié)果并不一樣....有時是報錯 no term, 迷惑desu


也沒什么顯著的 term

還是像示例里那樣用 decreasing FC

### with decreasing FC
load("genelist_ENTREZID_FC.Rdata")
gseaGO_BP2 <- gseGO(geneList     = entrezlist_FC,
                    OrgDb        = hgu133plus2.db,
                    keyType      = "ENTREZID",
                    ont          = "BP",
                    nPerm        = 1000,   ## 排列數(shù)
                    minGSSize    = 5,
                    maxGSSize    = 500,
                    pvalueCutoff = 0.05,
                    verbose      = TRUE)  ## 不輸出結(jié)果
gseaGO_BPresult2 <- gseaGO_BP2@result
save(gseaGO_BPresult2, file = "cp_gseaGO_BPresult2.Rdata")

結(jié)果果然正常多了

topGO

GSEA-like GO-BP (Using the genes score )

準確地說,也許不能直接叫做 GSEA 法。topGO 說明書里的說法是:

We will use two types of test statistics: Fisher's exact test which is based on gene counts, and a Kolmogorov-Smirnov like test which computes enrichment based on gene scores.

并且:

We can use both these tests since each gene has a score (representing how di erentially expressed a gene is) and by the means of topDiffGenes functions the genes are categorized into di erentially expressed or not di erentially expressed genes.

正因為有 topDiffGenes 這個函數(shù)的存在,相當于無形中給了 target list, 而 allGenes = 相當于 universe gene list. 所以這也許是 topGO 無論用類似 ORA 或 類似 GSEA 法構(gòu)建的 topGOdata ,算法和檢驗方法可以各種混用的原因。(maybe <@_@>

## topGO GSEA GO-BP
library(topGO)
library(hgu133plus2.db)
load("genelist_PROBEID_pVal.Rdata")

## function: topDiffGenes
topDiffGenes <- function(allScore) {
  return(allScore < 0.01)   ## p.val < 0.01
}
sum(topDiffGenes(probelist_pval))
# [1] 12210

## construct a topGOdata object
GSEAGOdata_BP <- new("topGOdata",
                     ontology = "BP",
                     allGenes = probelist_pval,
                     geneSel = topDiffGenes,
                     nodeSize = 5,
                     annot = annFUN.db,
                     affyLib = "hgu133plus2.db")
## Kolmogorov-Smirnov like test - classic
KSres <- runTest(GSEAGOdata_BP, algorithm = "weight01", statistic = "ks") 
KSres
# 
# Description:  
# Ontology: BP 
# 'weight01' algorithm with the 'ks' test
# 9106 GO terms scored: 203 terms with p < 0.01
# Annotation data:
#     Annotated genes: 16454 
#     Significant genes: 9542 
#     Min. no. of genes annotated to a GO: 5 
#     Nontrivial nodes: 9106 
GSEA_BPres <- GenTable(GSEAGOdata_BP,
                       weight01KS = KSres,
                       orderBy = "weight01KS",
                       ranksOf = "weight01KS",
                       topNodes = 203)
save(GSEA_BPres, file = "topGO_gseaGO_BPresult.Rdata")

結(jié)果對比

先找出三種工具結(jié)果中共有的 GO term:

nrow(GOrillares)  ## GOrilla
# [1] 100
nrow(gseaGO_BPresult2)  ## clusterProfiler
# [1] 457
nrow(GSEA_BPres)  ## topGO
# [1] 203
GOrillavscp <- intersect(GOrillares$GO.term, gseaGO_BPresult2$ID)
GOrillavstop <- intersect(GOrillares$GO.term, GSEA_BPres$GO.ID)
triple <- intersect(GOrillavscp, GOrillavstop)
triple
# [1] "GO:0050867"

只有一個??

row.names(GOrillares) <- GOrillares$GO.term
row.names(GSEA_BPres) <- GSEA_BPres$GO.ID
theonly <- cbind(GOrillares[triple,],
                 gseaGO_BPresult2[triple,],
                 GSEA_BPres[triple,])
finetable <- subset(theonly, select = c("Description",
                    "Enrichment..N..B..n..b.", "P.value","FDR.q.value",
                    "setSize","enrichmentScore","pvalue", "p.adjust","qvalues", 
                    "Annotated", "Significant","weight01KS"))
finetable
#                                       Description Enrichment..N..B..n..b.
# GO:0050867 positive regulation of cell activation  35.01 (17224,328,12,8)
#             P.value FDR.q.value setSize enrichmentScore     pvalue
# GO:0050867 7.67e-11    6.28e-08     312       0.2948657 0.00203252
#              p.adjust    qvalues Annotated Significant weight01KS
# GO:0050867 0.04044123 0.03411746       313         174    0.00383

看起來, GOrilla 的結(jié)果和常用的兩個R包的結(jié)果差的還是挺多的??

出于好奇看一下 topGO 和 clusterProfiler 結(jié)果的差別

topvscp <- intersect(GSEA_BPres$GO.ID, gseaGO_BPresult2$ID)
length(topvscp)
# [1] 19

大概就是這個亞子。


最后,向大家隆重推薦生信技能樹的一系列干貨!

  1. 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小時生信工程師教學(xué)視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招學(xué)徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容