BOWTIE2 進(jìn)行基因組比對

整理ChIP-seq / CUT & Tag 分析時(shí)用到的工具。本文只對使用的工具用法進(jìn)行簡單介紹。

Bowtie 2是常用的基因組比對軟件。其原理在此不過多贅述,有興趣的同學(xué)可以參閱其官方文檔以及其發(fā)表的文章(https://doi.org/10.1038/nmeth.1923)。下面簡單介紹Bowtie 2 Index和比對的命令及個(gè)人常用參數(shù)。

用法

Index

bowtie2-build [options]* <reference_in> <bt2_base>

<reference_in>:如果此處使用-f 參數(shù),則指明index的參考fasta 文件;如果使用-c參數(shù),則指明index的參考序列,例如,GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.
<bt2_base>:指的是生成的index文件的前綴,默認(rèn)情況,bowtie2-build產(chǎn)生NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.
--threads 使用的線程數(shù)

例子

bowtie2-build -f /public/Reference/GRCh38.primary_assembly.genome.fa --threads 24 GRCh38

上述命令使用該fasta文件/public/Reference/GRCh38.primary_assembly.genome.fa ,在當(dāng)前位置產(chǎn)生前綴為GRCh38的index文件。

Alignment

單端測序比對

bowtie2 [options]* -x <bt2-idx> -U <fq> -S <sam_output> -p <threads> 2>Align.summary

-x:參考基因組index文件的前綴(包括路徑)
-U:單端測序的fastq文件
-S:輸出的SAM文件,包含比對結(jié)果
-p:使用的線程數(shù)
"2>Align.summary":將輸出到屏幕的標(biāo)準(zhǔn)誤(standard error)重導(dǎo)向到"Align.summary"文件,其格式通常如下

## Single-end
20000 reads; of these:
  20000 (100.00%) were unpaired; of these:
    1247 (6.24%) aligned 0 times
    18739 (93.69%) aligned exactly 1 time
    14 (0.07%) aligned >1 times
93.77% overall alignment rate

## Paired-end
10000 reads; of these:
  10000 (100.00%) were paired; of these:
    650 (6.50%) aligned concordantly 0 times
    8823 (88.23%) aligned concordantly exactly 1 time
    527 (5.27%) aligned concordantly >1 times
    ----
    650 pairs aligned concordantly 0 times; of these:
      34 (5.23%) aligned discordantly 1 time
    ----
    616 pairs aligned 0 times concordantly or discordantly; of these:
      1232 mates make up the pairs; of these:
        660 (53.57%) aligned 0 times
        571 (46.35%) aligned exactly 1 time
        1 (0.08%) aligned >1 times
96.70% overall alignment rate
The indentation indicates how subtotals relate to t

雙端測序比對

bowtie2 [options]* -x <bt2-idx> -1 <fq1> -2 <fq2> -S <sam_output> -p <threads> 2>Align.summary

雙端比對模式基本與單端一致,只需替換fastq文件傳入的參數(shù)即可
-1:一鏈fastq文件
-2:二鏈fastq文件

Bowtie2 還有更多詳細(xì)的比對參數(shù)可以調(diào)整,這里就不一一介紹了。下面再介紹其輸出的SAM文件中各列的含義。

SAM OUTPUT

SAM文件的每一行代表一個(gè)reads的比對情況,至少包含了12列(tab分割),從左往右,每一列的含義依次為:

  1. Read的名字
  2. flags之和

在bowtie2中,flags的含義為
1
The read is one of a pair
2
The alignment is one end of a proper paired-end alignment
4
The read has no reported alignments
8
The read is one of a pair and has no reported alignments
16
The alignment is to the reverse reference strand
32
The other mate in the paired-end alignment is aligned to the reverse reference strand
64
The read is mate 1 in a pair
128
The read is mate 2 in a pair
注意每個(gè)比對軟件flags的含義有所區(qū)別

  1. 比對到的參考基因組染色體名稱
  2. read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based)
  3. 比對質(zhì)量
  4. CIGAR字符串,用以表征比對的結(jié)果
  5. 雙端測序中,二鏈所比對上的染色體名稱,如果與一鏈相同則為=,如果沒有二鏈則為*
  6. 雙端測序中,二鏈read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based),如果沒有二鏈則為0
  7. 推測的一鏈與二鏈之間的片段長度。該值為負(fù)表明,二鏈比對到一鏈的上游;該值為0表明二鏈沒有比對上;該值為non-0表明二鏈與一鏈比對到不同的染色體上(non-0如何理解?)
  8. Read的序列
  9. ASCII 編碼的read堿基質(zhì)量
  10. 可選的列,包括以下這些
AS:i:<N> Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read. 
XS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i. 
YS:i:<N> Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment. 
XN:i:<N> The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read. 
XM:i:<N> The number of mismatches in the alignment. Only present if SAM record is for an aligned read. 
XO:i:<N> The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read. 
XG:i:<N> The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read. 
NM:i:<N> The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read. 
YF:Z:<S> String indicating reason why the read was filtered out. See also: Filtering. Only appears for reads that were filtered out. 
YT:Z:<S> Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly. 
MD:Z:<S> A string representation of the mismatched reference bases in the alignm

以上就是對Bowtie 2進(jìn)行基因組比對的一些總結(jié),以后有新的心得再做補(bǔ)充。

ref:
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#how-is-bowtie-2-different-from-bowtie-1

完。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容