整理ChIP-seq / CUT & Tag 分析時(shí)用到的工具。本文只對使用的工具用法進(jìn)行簡單介紹。

Bowtie 2是常用的基因組比對軟件。其原理在此不過多贅述,有興趣的同學(xué)可以參閱其官方文檔以及其發(fā)表的文章(https://doi.org/10.1038/nmeth.1923)。下面簡單介紹Bowtie 2 Index和比對的命令及個(gè)人常用參數(shù)。
用法
Index
bowtie2-build [options]* <reference_in> <bt2_base>
<reference_in>:如果此處使用-f 參數(shù),則指明index的參考fasta 文件;如果使用-c參數(shù),則指明index的參考序列,例如,GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.
<bt2_base>:指的是生成的index文件的前綴,默認(rèn)情況,bowtie2-build產(chǎn)生NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.
--threads 使用的線程數(shù)
例子
bowtie2-build -f /public/Reference/GRCh38.primary_assembly.genome.fa --threads 24 GRCh38
上述命令使用該fasta文件/public/Reference/GRCh38.primary_assembly.genome.fa ,在當(dāng)前位置產(chǎn)生前綴為GRCh38的index文件。

Alignment
單端測序比對
bowtie2 [options]* -x <bt2-idx> -U <fq> -S <sam_output> -p <threads> 2>Align.summary
-x:參考基因組index文件的前綴(包括路徑)
-U:單端測序的fastq文件
-S:輸出的SAM文件,包含比對結(jié)果
-p:使用的線程數(shù)
"2>Align.summary":將輸出到屏幕的標(biāo)準(zhǔn)誤(standard error)重導(dǎo)向到"Align.summary"文件,其格式通常如下
## Single-end
20000 reads; of these:
20000 (100.00%) were unpaired; of these:
1247 (6.24%) aligned 0 times
18739 (93.69%) aligned exactly 1 time
14 (0.07%) aligned >1 times
93.77% overall alignment rate
## Paired-end
10000 reads; of these:
10000 (100.00%) were paired; of these:
650 (6.50%) aligned concordantly 0 times
8823 (88.23%) aligned concordantly exactly 1 time
527 (5.27%) aligned concordantly >1 times
----
650 pairs aligned concordantly 0 times; of these:
34 (5.23%) aligned discordantly 1 time
----
616 pairs aligned 0 times concordantly or discordantly; of these:
1232 mates make up the pairs; of these:
660 (53.57%) aligned 0 times
571 (46.35%) aligned exactly 1 time
1 (0.08%) aligned >1 times
96.70% overall alignment rate
The indentation indicates how subtotals relate to t
雙端測序比對
bowtie2 [options]* -x <bt2-idx> -1 <fq1> -2 <fq2> -S <sam_output> -p <threads> 2>Align.summary
雙端比對模式基本與單端一致,只需替換fastq文件傳入的參數(shù)即可
-1:一鏈fastq文件
-2:二鏈fastq文件
Bowtie2 還有更多詳細(xì)的比對參數(shù)可以調(diào)整,這里就不一一介紹了。下面再介紹其輸出的SAM文件中各列的含義。
SAM OUTPUT
SAM文件的每一行代表一個(gè)reads的比對情況,至少包含了12列(tab分割),從左往右,每一列的含義依次為:
- Read的名字
- flags之和
在bowtie2中,flags的含義為
1
The read is one of a pair
2
The alignment is one end of a proper paired-end alignment
4
The read has no reported alignments
8
The read is one of a pair and has no reported alignments
16
The alignment is to the reverse reference strand
32
The other mate in the paired-end alignment is aligned to the reverse reference strand
64
The read is mate 1 in a pair
128
The read is mate 2 in a pair
注意每個(gè)比對軟件flags的含義有所區(qū)別
- 比對到的參考基因組染色體名稱
- read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based)
- 比對質(zhì)量
- CIGAR字符串,用以表征比對的結(jié)果
- 雙端測序中,二鏈所比對上的染色體名稱,如果與一鏈相同則為
=,如果沒有二鏈則為* - 雙端測序中,二鏈read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based),如果沒有二鏈則為
0 - 推測的一鏈與二鏈之間的片段長度。該值為負(fù)表明,二鏈比對到一鏈的上游;該值為0表明二鏈沒有比對上;該值為non-0表明二鏈與一鏈比對到不同的染色體上(non-0如何理解?)
- Read的序列
- ASCII 編碼的read堿基質(zhì)量
- 可選的列,包括以下這些
AS:i:<N> Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read.
XS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i.
YS:i:<N> Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.
XN:i:<N> The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.
XM:i:<N> The number of mismatches in the alignment. Only present if SAM record is for an aligned read.
XO:i:<N> The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
XG:i:<N> The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
NM:i:<N> The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.
YF:Z:<S> String indicating reason why the read was filtered out. See also: Filtering. Only appears for reads that were filtered out.
YT:Z:<S> Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.
MD:Z:<S> A string representation of the mismatched reference bases in the alignm
以上就是對Bowtie 2進(jìn)行基因組比對的一些總結(jié),以后有新的心得再做補(bǔ)充。
ref:
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#how-is-bowtie-2-different-from-bowtie-1
完。