上一期提到使用exomePeak2進(jìn)行m6A的Peak Calling,結(jié)果如下,每個(gè)樣本一個(gè)結(jié)果:
├── ADDInfo
│ ├── ADDInfo_GLM_allDesigns.csv:Peak識別過程中的一些模型統(tǒng)計(jì)參數(shù)值
│ ├── ADDInfo_ReadsCount.csv:每個(gè)Peak的count值
│ └── ADDInfo_RPKM.csv:每個(gè)peak的RPKM值
├── Mod.bed :peak的bed格式
├── Mod.csv:peak的csv格式
├── Mod.rds:peak的Rdata格式
└── RunInfo.txt:運(yùn)行過程中的一些參數(shù)文件
其中Mod.csv與Mod.bed前12列相同,bed為bed12,具體每一列如下,exomePeak2的結(jié)果不僅直接對每一個(gè)peak進(jìn)行了基因注釋,還輸出了對應(yīng)Peak的IP與Input樣本的count值與RPKM值。篩選Peak主要根據(jù)pvalue或者padj。默認(rèn)為pvalue<1e-05

- 第1列 chr:染色體編號
- 第2列 chromStart:Peak起始位點(diǎn)
- 第3列 chromEnd:Peak終止位點(diǎn)
- 第4列 name:Peak名字
- 第5列 score:the -log2 p value of the peak
- 第6列 strand:Peak在參考基因組上的鏈信息,+表示正鏈,-表示負(fù)鏈
- 第7列 thickStart:同第二列
- 第8列 thickEnd: 同第三列
- 第9列 itemRgb: the column for the RGB encoded color in BED file visualization.
- 第10列 blockCount: the block (exon) number within the peak.
- 第11列 blockSizes: the widths of the blocks.
- 第12列 blockStarts: the start positions of the blocks.
- 第13列 geneID: Peak區(qū)間內(nèi)注釋到的基因ID
- 第14列 ReadsCount.input:the total read count of input samples.
- 第15列 ReadsCount.IP: the total read count of IP samples.
這一列作為重點(diǎn)給予以下說明:
- 第16列 log2FoldChange: the estimate of IP over input log2 fold change (coefficient estimates of βi,1βi,1 in GLM), the Bayesian estimation implemented in the bioconductor package
apeglm[3] will be returned if the regularized NB GLMs are fitted using DESeq2.
這一列很多人會(huì)有個(gè)誤解,主要跟作者給的注釋也有一定關(guān)系
包中的注釋是這個(gè)樣子的:
log2FC_cutoff a numeric value for the cutoff on log2 IP over input fold changes in peak calling; default = 0.
看著就是每一個(gè)Peak的IP樣本中count/ Input樣本中的count之后的log2值,即倍數(shù)關(guān)系
但是,當(dāng)用戶設(shè)置了這個(gè)參數(shù),比如log2FC_cutoff=1的時(shí)候,發(fā)現(xiàn)結(jié)果中依然會(huì)有小于1的結(jié)果,后來給作者寫信,給出答復(fù)如下,供大家參考:
回復(fù)1:
函數(shù)中的p_cutoff和log2FC_cutoff其實(shí)是peak calling時(shí)設(shè)定的閾值,exomePeak2的算法是這樣的:
先在外顯子組上生成劃窗,對每個(gè)劃窗收集到的IP / input 樣本count做統(tǒng)計(jì)檢驗(yàn),并對其p-value (one-sided) 和 log2FC進(jìn)行cut (即p_cutoff和log2FC_cutoff所指定的),顯著的劃窗會(huì)被merge成為peak region,并再次count和計(jì)算統(tǒng)計(jì)值。
所以在exomePeak2最終輸出的表格中,其實(shí)是基于merge的peak所計(jì)算的統(tǒng)計(jì)值,本身已經(jīng)包含了一些positive bias了。而差異甲基化也是基于所有樣本peak calling后得到的peak進(jìn)行的統(tǒng)計(jì)。
回復(fù)2:
那個(gè)log2FC是peak calling時(shí)針對sliding window用的filter,而輸出的結(jié)果是在peak 層面上從新計(jì)算的統(tǒng)計(jì)值,并未和sliding window共用filter。為了避免給后來使用者造成困惑,我在更新的版本中將默認(rèn)的peak calling log2FC cutoff改成0了 (在peak calling中l(wèi)og2FC并不影響很大,主要的影響還是p value)。
- 第17列*pvalue: the Wald test p value on the βi,1βi,1 coefficient in the GLM.
- 第18列*padj: 對pvalue進(jìn)行BH矯正后的fdr值
這個(gè)結(jié)果中已經(jīng)有了Peak區(qū)間的相關(guān)基因注釋,但是沒有功能區(qū)域注釋,即所有文章中最常見的這幅圖:

對于左圖,我們這里給大家介紹繪制m6A修飾位點(diǎn)在基因組上的分布特征圖可視化包:Guitar包
Guitar包與exomePeak2開發(fā)者來自同一個(gè)課題組,熟悉這個(gè)課題組的應(yīng)該知道他們在關(guān)于m6A的分析上開發(fā)了許多的包。
這個(gè)包目前維護(hù)比較少,為了避免踩坑,建議安裝"2.6.0"的版本。
Guitar包出來的結(jié)果特征如下:

Figure 3: m6 A sites on mRNA and lncRNA. In mRNA, the strongest binding sites (“IP/input ratio” larger than 8) are highly enriched near stop codon side of 3?? UTR and deficient on TSS (transcription starting site) side of 5?? UTR and the phenomena are more prominent than lowly methylated sites. In contrast, the m6 A sites are almost uniformly distributed on lncRNA despite the “IP/input ratio” specified. Please note that, in this figure, the size of 5?? UTR, CDS, and 3?? UTR reflects their true width within the transcriptome, so the 5?? UTR region is much shorter compared with the other two components. This result is based on peaks called on human HepG2 dataset [10].
參考:Biomed Research International, 28 Apr 2016, 2016:8367534
代碼如下:
rm(list=ls())
options(stringsAsFactors = F)
# load package
library(Guitar)
package.version("Guitar")
# [1] "2.6.0"
stBedFiles <- list("data/KO1/Mod.bed")
txdb <- makeTxDbFromGFF(file = "data/KO1/Mus_musculus.GRCm38.101.gtf.gz",
format="gtf",
dataSource="Ensembl",
organism="Mus musculus")
p <- GuitarPlot(txTxdb = txdb,
stBedFiles = stBedFiles,
headOrtail = FALSE,
enableCI = FALSE,
mapFilterTranscript = TRUE,
pltTxType = c("mrna"),
stGroupName = "KO1")
p <- p + ggtitle(label = "Distribution on mRNA") +
theme(plot.title = element_text(hjust = 0.5)) + theme_bw()
png(file="data/KO1_guitarPlot_mRNA.pdf",width=8,height=6)
print(p)
dev.off()
png(filename="data/KO1_guitarPlot_mRNA.png",width = 1000,height = 800, res = 180)
print(p)
dev.off()
結(jié)果圖如下:Peaks主要富集在終止密碼子以及3'UTR上。

對于右圖,一般用ChIPseeker軟件來進(jìn)行區(qū)間注釋。
rm(list=ls())
options(stringsAsFactors = F)
#BiocManager::install("ChIPseeker",ask = F)
library(ChIPseeker)
library(org.Hs.eg.db)
library(clusterProfiler)
library(GenomicFeatures)
# annotation file gtf
txdb <- makeTxDbFromGFF(file = "data/KO1/Mus_musculus.GRCm38.101.gtf.gz", format="gtf",
dataSource="Ensembl", organism="Mus musculus")
# overlap
peak_file <- readPeakFile("data/KO1/Mod.bed")
## peak annotation
#region select:"Promoter", "5UTR", "3UTR", "Exon", "Intron", "Downstream", "Intergenic"
peak_anno <- annotatePeak(peak_file,
tssRegion = c(-3000, 3000),
TxDb = txdb,
assignGenomicAnnotation = TRUE,
genomicAnnotationPriority = c("5UTR", "3UTR", "Exon","Intron","Intergenic"),
addFlankGeneInfo = TRUE,
flankDistance = 5000)
pdf(file = "KO1.Anno.Pie.pdf",width = 6,height = 5)
plotAnnoPie(peak_anno)
dev.off()
png(filename="KO1.Anno.Pie.png" ,width=1000, height=850, res=200)
plotAnnoPie(peak_anno)
dev.off()
注釋結(jié)果如下:

如果不喜歡這種注釋,當(dāng)然還可以直接用bedtools進(jìn)行注釋~
下一期分享另類可視化方法,如何與文獻(xiàn)中的圖片一樣好看!