如果用Hista2將原始的reads 比對(duì)到參考基因組上,會(huì)產(chǎn)生一個(gè)log, 這個(gè)log 會(huì)顯示你比對(duì)的結(jié)果,我的結(jié)果如下:
[will@200server Arb-hisat]$ hisat2 -t -x E_coli -1 SRR1770413_1.fastq.gz -2 SRR1770413_2.fastq.gz -S RNA.sam
Time loading forward index: 00:00:00
Time loading reference: 00:00:00
Multiseed full-index search: 00:01:14
643253 reads; of these:
643253 (100.00%) were paired; of these:
240418 (37.38%) aligned concordantly 0 times
400101 (62.20%) aligned concordantly exactly 1 time
2734 (0.43%) aligned concordantly >1 times
----
240418 pairs aligned concordantly 0 times; of these:
1872 (0.78%) aligned discordantly 1 time
----
238546 pairs aligned 0 times concordantly or discordantly; of these:
477092 mates make up the pairs; of these:
432923 (90.74%) aligned 0 times
43549 (9.13%) aligned exactly 1 time
620 (0.13%) aligned >1 times
66.35% overall alignment rate #比對(duì)率
Time searching: 00:01:14
Overall time: 00:01:14
我們發(fā)現(xiàn)比對(duì)率是 66.35%
接著使用 samtools 的 flagstat 進(jìn)行統(tǒng)計(jì),也能得到比對(duì)率
[will@200server Arb-hisat]$ samtools flagstat RNA.bam
1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
866610 + 0 mapped (66.69% : N/A)
1286506 + 0 paired in sequencing
643253 + 0 read1
643253 + 0 read2
805670 + 0 properly paired (62.62% : N/A)
812960 + 0 with itself and mate mapped
40623 + 0 singletons (3.16% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
我們發(fā)現(xiàn)這里的比對(duì)率是 66.69%
差別不是很大,但是為什么同樣的文件,統(tǒng)計(jì)的比對(duì)率是不一樣的呢?
經(jīng)過研究發(fā)現(xiàn),是兩者的計(jì)算方式不同
我們先來看第一個(gè)hisat2的結(jié)果
643253 reads; of these:
643253 (100.00%) were paired; of these:
240418 (37.38%) aligned concordantly 0 times
400101 (62.20%) aligned concordantly exactly 1 time
2734 (0.43%) aligned concordantly >1 times
----
240418 pairs aligned concordantly 0 times; of these:
1872 (0.78%) aligned discordantly 1 time
----
238546 pairs aligned 0 times concordantly or discordantly; of these:
477092 mates make up the pairs; of these:
432923 (90.74%) aligned 0 times
43549 (9.13%) aligned exactly 1 time
620 (0.13%) aligned >1 times
66.35% overall alignment rate
它是以pair作為基數(shù)的 總共有 643253 pair
所以
((400101+2734+1872)X2 + 43549 + 620) / (643253*2) = 66.35%
hista2 計(jì)算正確!
然后我們來看 samtools 的統(tǒng)計(jì)結(jié)果
1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
866610 + 0 mapped (66.69% : N/A)
1286506 + 0 paired in sequencing
643253 + 0 read1
643253 + 0 read2
805670 + 0 properly paired (62.62% : N/A)
812960 + 0 with itself and mate mapped
40623 + 0 singletons (3.16% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
它的map 率計(jì)算就是 以map 上的reads 除以total 的reads
866610 / 1299533 = 66.69%
計(jì)算也完全正確
那么大家可能覺得奇怪,怎么計(jì)算都一樣的呢
其實(shí)差別的根源在于, alignment 有兩次, 一次是 primarily 一個(gè)是 secondary 。
samtools output
1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary
如果我們把總的reads 減去 secondary 的
1299533 - 13027 = 1286506
我們發(fā)現(xiàn), 這個(gè)數(shù)字就是 hista2結(jié)果pair的兩倍
hisat2 result
643253 reads; of these:
643253 (100.00%) were paired; of these:
240418 (37.38%) aligned concordantly 0 times
400101 (62.20%) aligned concordantly exactly 1 time
2734 (0.43%) aligned concordantly >1 times
643253 X 2 = 1286506
1299533 - 13027 = 1286506 # samtools
643253 X 2 = 1286506 # hisat2
這樣兩個(gè)結(jié)果就可以聯(lián)系起來了
兩個(gè)結(jié)果都沒有錯(cuò),只是統(tǒng)計(jì)方法和策略不一樣
samtools 用了 secondary 的結(jié)果,所以比對(duì)率會(huì)稍微高一點(diǎn)
PS. 有疑問真的應(yīng)該多去國外論壇搜索,我中文找了一圈也沒找到答案,多虧洲更發(fā)我samtools的源碼。