之前寫過ChIP-seq的基礎(chǔ)分析:ChIP-seq數(shù)據(jù)分析(一):從raw reads到peaks,和此文有很多相似之處。
簡介
ATAC-seq的目的是鑒定DNA的可觸及區(qū)域,即未被組蛋白保護(hù)的開放區(qū)域,相當(dāng)于DNase I 能夠起作用的區(qū)域。用來探究轉(zhuǎn)錄因子反式調(diào)控機(jī)制。
參考:
ATAC-seq測序的原理及應(yīng)用http://www.360doc.com/content/18/1221/00/52645714_803259507.shtml
0. 參考基因組建立索引
bowtie2-build -f Arabidopsis_thaliana.TAIR10.dna.toplevel.fa TAIR10.bybowtie2
1. 下載數(shù)據(jù)并解壓
for sample in SRR58746{57..62}
do
prefetch $sample
done
ls *sra | while read id; do fastq-dump --gzip --split-3 --defline-qual '+' --defline-seq '@$ac-$si/$ri' ${id} & done
2. 數(shù)據(jù)質(zhì)控
cd ~/ATAC_seq/raw
mkdir fastqc_report
fastqc *fastq.gz -o ./fastqc_report/
cd fastqc_report/
~/miniconda3/bin/multiqc ./
安裝multiqc
git clone https://github.com/ewels/MultiQC.git
~/miniconda3/bin/python3 setup.py install
~/miniconda3/bin/multiqc
從文獻(xiàn)中可以看到測的是paired-end 36 nt reads。在經(jīng)過fastqc檢測之后發(fā)現(xiàn),不含接頭且測序質(zhì)量很高,有一個樣本的一端reads中含有1/1000的overrepresented sequences,沒有做處理。
3. 比對
cd ~/ATAC_seq/align
for i in SRR58746{57..62}
do
bowtie2 -x /ifs1/Grp3/huangsiyuan/Ref/Arabidopsis/TAIR10.bybowtie2 -X 1000 -1 /ifs1/Grp3/huangsiyuan/ATAC_seq/raw/${i}_1.fastq.gz -2 /ifs1/Grp3/huangsiyuan/ATAC_seq/raw/${i}_2.fastq.gz | samtools view -F 4 -bS - > ./${i}.bam &
done
在上述比對命令中,
-F 4表示:只要flag中(是一個累加值)含有4,就不要這一個reads比對記錄;4表示:read unmapped (0x4)
下文中有-f的用法,
-f 2表示:只要flag中(是一個累加值)含有2,就保留這個比對記錄;2表示:read mapped in proper pair (0x2)
關(guān)于sam文件的flag值可以在這里查詢到,
http://broadinstitute.github.io/picard/explain-flags.html
4. 比對之后的質(zhì)控
cd ~/ATAC_seq/align
for i in SRR58746{57..62}
do
samtools view -f 2 -q 30 -o ${i}.f2.q30.bam ${i}.bam #注1
samtools sort -@ 4 -o ${i}.f2.q30.sort.bam ${i}.f2.q30.bam
java -jar /ifs1/Software/biosoft/picard/picard.jar MarkDuplicates VERBOSITY=ERROR QUIET=true \
CREATE_INDEX=false REMOVE_DUPLICATES=true \
INPUT=${i}.f2.q30.sort.bam \
OUTPUT=${i}.f2.q30.sort.rmdup.bam \
M=${i}.marked_dup.log
samtools view -h ${i}.f2.q30.sort.rmdup.bam | grep "Mt" -v | grep "Pt" -v | samtools view -bS -o ${i}.final.bam
done
注1
此時bam文件的第二列flag值只有4種
99: 0x1; 0x2; 0x20; 0x40(都是16進(jìn)制)
147: 0x1; 0x2; 0x10; 0x80
83: 0x1; 0x2; 0x10; 0x40
163: 0x1; 0x2; 0x20; 0x80
我的理解
0x1:read paired,表示雙端測序,并且兩條reads都在比對記錄中
0x2:read mapped in proper pair,表示這條read比對上了,它的另一條read也比對上,并且它倆距離合適
0x10:read reverse strand,這條read比對到負(fù)鏈
0x20:mate reverse strand,這條read對應(yīng)的另一條read比對到負(fù)鏈(一般都是一條比對到正鏈,另一條比對到負(fù)鏈)
0x40:first in pair,一對reads中的第一條
0x80:second in pair,一對reads中的第二條(根據(jù)read的名稱來判斷,如fastq文件中reads名稱的/1、/2,這個標(biāo)識會在比對之后省略)
5. 插入片段長度分布
- 什么叫insert size?參考《一篇文章說清楚什么是“插入片段”》——公眾號“堿基礦工”
- 如何從sam文件中求得insert size:在“read mapped in proper pair”的前提下,取第九列
cd ~/ATAC_seq/align
for i in SRR58746{57..62}
do
samtools view -f 64 ${i}.final.bam | perl -ane 'print abs($F[8]);print "\n";' | sort -n > ${i}.insert_size.txt
done
這里-f 64的64就是0x40,表示提取一半的比對記錄,因?yàn)閮蓷lreads對應(yīng)一個insert size。
在RStudio上運(yùn)行
df <- read.table("SRR5874659.insert_size.txt",header = F)
hist(df$V1,main = "insert size distribution",ylab = "read count",xlab = "insert size",breaks = seq(0,max(df$V1),by = 1), col = "lightblue",border="lightblue")
axis(side = 1,at=seq(0,max(df$V1),by = 100),labels = seq(0,max(df$V1),by = 100))

6. Peak Calling
Chip-seq與ATAC-seq call peaks的區(qū)別,前者peaks是代表抗體結(jié)合轉(zhuǎn)錄因子的位點(diǎn),后者peaks是代表Tn5轉(zhuǎn)座酶切開染色質(zhì)開放區(qū)的兩端,因此在一個位置,前者peaks有一個,后者有兩個

6.1 bamtobed
for i in SRR58746{57..62}
do
bedtools bamtobed -i ${i}.final.bam > ${i}.final.bed
done
這里需要留意一下bam和bed兩種文件的起始坐標(biāo),分別是以1、0開始的
1 0 35 SRR5874657-27764514/2 42 +
1 13 48 SRR5874657-32600261/2 42 +
1 18 54 SRR5874657-18292360/1 42 +
6.2 call peaks
cd ~/ATAC_seq/peak/
for i in SRR58746{57..62}
do
macs2 callpeak -t ~/ATAC_seq/align/${i}.final.bed -n ${i} --shift -75 --extsize 150 --nomodel -B --SPMR -g 110000000 --keep-dup all
done
關(guān)于結(jié)果的解讀,參考:http://www.itdecent.cn/p/e83a7e10ea2e
7. bedGraphToBigWig
之前ChIP-seq里面講過deeptools來轉(zhuǎn)格式bam2bw,現(xiàn)在學(xué)的是bdg2bw,是一回事嗎?
這一步使用到的工具:bedSort bedClip bedGraphToBigWig。下載:http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedSort
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedClip
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
下載之后,賦予權(quán)限,再轉(zhuǎn)移到一個PATH包含的目錄下,
mv bedSort bedClip bedGraphToBigWig ~/local/bin/
#求染色體序列大小
samtools view -H SRR5874657.final.bam | perl -ne 'if(/SN:(\S+)\s+LN:(\d+)/){print "$1\t$2\n"}' > chr.size
for i in SRR58746{57..62}
do
#對treat_pileup.bdg文件的前三列排序,原文件的染色體順序不定
~/local/bin/bedSort ${i}_treat_pileup.bdg ${i}_treat_pileup.sort.bdg
#根據(jù)染色體的最大坐標(biāo),修正部分記錄的第三列(原文件的這些記錄的第三列大于max_chr_pos,原因是前面在call peaks的時候reads有偏移);
#這樣一來,這些記錄的第二列就比第三列大了,需要過濾掉
~/local/bin/bedClip -truncate ${i}_treat_pileup.sort.bdg ~/ATAC_seq/align/chr.size stdout | perl -ane 'print if($F[1] < $F[2])' > ${i}_treat_pileup.bedgrafh
#以上兩步的目的是生成一個“可讀性”良好的bedgrafh文件,與原bedgrafh文件有區(qū)別
#bedgraph to bigwig
~/local/bin/bedGraphToBigWig ${i}_treat_pileup.bedgrafh ~/ATAC_seq/align/chr.size ${i}_treat_pileup.bw
done
8. 可視化
8.1 TSS Enrichment
TSS富集結(jié)果依賴于基因注釋結(jié)果。這個圖在學(xué)習(xí)ChIP-seq的時候也畫過,方法一樣
cd ~/ATAC_seq/visual
for i in SRR58746{57..62}
do
~/miniconda3/bin/computeMatrix reference-point -S ~/ATAC_seq/peak/${i}_treat_pileup.bw -R ~/Ref/Arabidopsis/Arabidopsis_thaliana.TAIR10.44.gtf -a 3000 -b 3000 -p 1 -o ${i}.matrix.TSS.3k.gz
~/miniconda3/bin/plotHeatmap -m ${i}.matrix.TSS.3k.gz -o ${i}.TSS.3k.png --colorMap Reds
done
8.2 IGV中查看數(shù)據(jù)
導(dǎo)入bigwig文件和narrowPeak文件到IGV中(可以顯示peaks的名稱),即可查看ATAC-Seq數(shù)據(jù)的結(jié)果。這一步之前也做過
9. 多樣品數(shù)據(jù)可重復(fù)評估
9.1 用bedtools計(jì)算overlap
wc -l SRR587465{7..9}_peaks.narrowPeak
30123 SRR5874657_peaks.narrowPeak
30644 SRR5874658_peaks.narrowPeak
30806 SRR5874659_peaks.narrowPeak
bedtools intersect -a SRR5874657_peaks.narrowPeak -b SRR5874658_peaks.narrowPeak -wa > tmp
bedtools intersect -a tmp -b SRR5874659_peaks.narrowPeak -wa | wc -l && rm -f tmp
29310
wc -l SRR587466{0..2}_peaks.narrowPeak
23965 SRR5874660_peaks.narrowPeak
26038 SRR5874661_peaks.narrowPeak
24812 SRR5874662_peaks.narrowPeak
bedtools intersect -a SRR5874660_peaks.narrowPeak -b SRR5874661_peaks.narrowPeak -wa > tmp
bedtools intersect -a tmp -b SRR5874662_peaks.narrowPeak -wa | wc -l && rm -f tmp
22773
由以上可知,重復(fù)性不錯。
9.2 IDR評估
同時考慮peaks間的overlap和富集倍數(shù)的一致性。
使用和結(jié)果解讀參考:http://www.itdecent.cn/p/d8a7056b4294
#安裝
~/miniconda3/bin/conda install idr
#先根據(jù)-log10(p-value)進(jìn)行排序
for i in SRR58746{57..62}
do
sort -k8,8nr ~/ATAC_seq/peak/${i}_peaks.narrowPeak > ~/ATAC_seq/idr/${i}_peaks_sort.narrowPeak
done
#使用示例
~/miniconda3/bin/idr --samples SRR5874657_peaks_sort.narrowPeak SRR5874658_peaks_sort.narrowPeak -o SRR5874657_SRR5874658_idr --plot --input-file-type narrowP
eak --rank p.value &
會生成兩個文件:
SRR5874657_SRR5874658_idr.png
SRR5874657_SRR5874658_idr

沒有通過特定IDR閾值的peaks顯示為紅色
wc -l SRR5874657_SRR5874658_idr
28433 #共有的部分
awk '{if($5 >= 540) print $0}' SRR5874657_SRR5874658_idr | wc -l
19317 #滿足IDR的部分
19317/28433=67.9%
10. 注釋
這一步前面ChIP-seq也做了,感覺差不多
library(ChIPseeker)
library(GenomicFeatures)
ath <- makeTxDbFromGFF("E:/Computational_Biologist/生信積累/測序類/表觀遺傳/ChIP-seq/擬南芥ChIP-seq_BMC基因組/ref/Arabidopsis_thaliana.TAIR10.44.gff3")
peaksfile <- "SRR5874658_peaks.narrowPeak"
peak <- readPeakFile(peaksfile,header=F)
outname="SRR5874658"
peakAnno <- annotatePeak(peak,TxDb = ath,assignGenomicAnnotation = TRUE)
pdf(file = paste0(outname,"peakAnnotation.pdf"))
plotAnnoPie(peakAnno,main=paste0(outname,"\nDistribution of Peaks"),line=-8)
dev.off()
#保存peaks的注釋文件方便改圖
write.table(as.data.frame(peakAnno@anno),file = paste0(outname,"peakAnnotation.txt"),sep = "\t",row.names = FALSE)
