2018-12-04 Hisat2 map 結(jié)果與 samtools flagstat 結(jié)果不一致

如果用Hista2將原始的reads 比對(duì)到參考基因組上,會(huì)產(chǎn)生一個(gè)log, 這個(gè)log 會(huì)顯示你比對(duì)的結(jié)果,我的結(jié)果如下:

[will@200server Arb-hisat]$ hisat2 -t -x E_coli -1 SRR1770413_1.fastq.gz -2 SRR1770413_2.fastq.gz -S RNA.sam
Time loading forward index: 00:00:00
Time loading reference: 00:00:00
Multiseed full-index search: 00:01:14
643253 reads; of these:
  643253 (100.00%) were paired; of these:
    240418 (37.38%) aligned concordantly 0 times
    400101 (62.20%) aligned concordantly exactly 1 time
    2734 (0.43%) aligned concordantly >1 times
    ----
    240418 pairs aligned concordantly 0 times; of these:
      1872 (0.78%) aligned discordantly 1 time
    ----
    238546 pairs aligned 0 times concordantly or discordantly; of these:
      477092 mates make up the pairs; of these:
        432923 (90.74%) aligned 0 times
        43549 (9.13%) aligned exactly 1 time
        620 (0.13%) aligned >1 times
66.35% overall alignment rate      #比對(duì)率
Time searching: 00:01:14
Overall time: 00:01:14

我們發(fā)現(xiàn)比對(duì)率是 66.35%

接著使用 samtools 的 flagstat 進(jìn)行統(tǒng)計(jì),也能得到比對(duì)率

[will@200server Arb-hisat]$ samtools flagstat RNA.bam
1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
866610 + 0 mapped (66.69% : N/A)
1286506 + 0 paired in sequencing
643253 + 0 read1
643253 + 0 read2
805670 + 0 properly paired (62.62% : N/A)
812960 + 0 with itself and mate mapped
40623 + 0 singletons (3.16% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

我們發(fā)現(xiàn)這里的比對(duì)率是 66.69%

差別不是很大,但是為什么同樣的文件,統(tǒng)計(jì)的比對(duì)率是不一樣的呢?

經(jīng)過研究發(fā)現(xiàn),是兩者的計(jì)算方式不同
我們先來看第一個(gè)hisat2的結(jié)果

643253 reads; of these:
  643253 (100.00%) were paired; of these:
    240418 (37.38%) aligned concordantly 0 times
    400101 (62.20%) aligned concordantly exactly 1 time
    2734 (0.43%) aligned concordantly >1 times
    ----
    240418 pairs aligned concordantly 0 times; of these:
      1872 (0.78%) aligned discordantly 1 time
    ----
    238546 pairs aligned 0 times concordantly or discordantly; of these:
      477092 mates make up the pairs; of these:
        432923 (90.74%) aligned 0 times
        43549 (9.13%) aligned exactly 1 time
        620 (0.13%) aligned >1 times
66.35% overall alignment rate

它是以pair作為基數(shù)的 總共有 643253 pair
所以
((400101+2734+1872)X2 + 43549 + 620) / (643253*2) = 66.35%
hista2 計(jì)算正確!

然后我們來看 samtools 的統(tǒng)計(jì)結(jié)果

1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
866610 + 0 mapped (66.69% : N/A)
1286506 + 0 paired in sequencing
643253 + 0 read1
643253 + 0 read2
805670 + 0 properly paired (62.62% : N/A)
812960 + 0 with itself and mate mapped
40623 + 0 singletons (3.16% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

它的map 率計(jì)算就是 以map 上的reads 除以total 的reads

866610 / 1299533 = 66.69%

計(jì)算也完全正確

那么大家可能覺得奇怪,怎么計(jì)算都一樣的呢

其實(shí)差別的根源在于, alignment 有兩次, 一次是 primarily 一個(gè)是 secondary 。

samtools output
1299533 + 0 in total (QC-passed reads + QC-failed reads)
13027 + 0 secondary

如果我們把總的reads 減去 secondary 的

1299533 - 13027 = 1286506
我們發(fā)現(xiàn), 這個(gè)數(shù)字就是 hista2結(jié)果pair的兩倍

hisat2 result
643253 reads; of these:
  643253 (100.00%) were paired; of these:
    240418 (37.38%) aligned concordantly 0 times
    400101 (62.20%) aligned concordantly exactly 1 time
    2734 (0.43%) aligned concordantly >1 times

643253 X 2 = 1286506

1299533 - 13027 = 1286506 # samtools

643253 X 2 = 1286506 # hisat2

這樣兩個(gè)結(jié)果就可以聯(lián)系起來了

兩個(gè)結(jié)果都沒有錯(cuò),只是統(tǒng)計(jì)方法和策略不一樣

samtools 用了 secondary 的結(jié)果,所以比對(duì)率會(huì)稍微高一點(diǎn)

PS. 有疑問真的應(yīng)該多去國外論壇搜索,我中文找了一圈也沒找到答案,多虧洲更發(fā)我samtools的源碼。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容