Charpter_13 Short Read Aligners
背景
定義:Short read Aligners are commonly used software tools in bioinformatics, designed to align a very large number of short reads(billions).
短序列比對(duì)是以2005年二代測(cè)序革命所帶來(lái)的一系列需求性軟件,過(guò)去測(cè)序是一項(xiàng)比較昂貴的事情,所以那時(shí)候的比對(duì)軟件都會(huì)以高精度準(zhǔn)確性(near-optimal alignments)為準(zhǔn)則。二代高通量測(cè)序革命以來(lái),生物數(shù)據(jù)量開(kāi)始以指數(shù)形式開(kāi)始爆炸性增長(zhǎng),面對(duì)這種短讀長(zhǎng)(50-300),超高通量的數(shù)據(jù),科學(xué)家開(kāi)始研究能夠?qū)⒍蘲eads快速準(zhǔn)確回帖上基因組數(shù)據(jù)的算法,并開(kāi)發(fā)相應(yīng)的軟件。高通量數(shù)據(jù)比對(duì)軟件雨后春筍般開(kāi)始出現(xiàn)。
Mapping和Alignment區(qū)別
Mapping:
- A mapping is a region where a read sequence is placed
- A mapping is regarded to be correct if it overlaps the true region
Alignment:
- An alignment is the detailed placement of each base in a read.
- An alignment is regarded to be correct if each base is placed correctly.
Mapping強(qiáng)調(diào)將短reads快速準(zhǔn)確的回帖到基因組上的某一位置上,強(qiáng)調(diào)的是具體的位置,而不強(qiáng)調(diào)序列的完全一致;而Alignment聯(lián)配強(qiáng)調(diào)檢索序列和目標(biāo)序列的每個(gè)堿基base都能有最佳匹配。比如SNP,基因結(jié)構(gòu)(indel等)變異檢測(cè)就強(qiáng)調(diào)Alignment,而RNA-seq是比對(duì)上基因的定量(相對(duì)宏觀),強(qiáng)調(diào)MAPPPING。
如何選擇比對(duì)軟件
看具體應(yīng)用場(chǎng)景,比如重測(cè)序大多用bwa,轉(zhuǎn)錄組用Hisat2,bowtie,Star等。
BWA和Bowtie
BWA由Li Heng大神所開(kāi)發(fā),運(yùn)用最為廣泛的比對(duì)軟件。最新的比對(duì)算法為mem(maximally exact matches)。aln處理小于100bp的reads,mem處理大于70bp的reads
Bowtie第一個(gè)Burrows-Wheeler算法的短讀長(zhǎng)比對(duì)軟件。分為bowtie和bowtie2,分別對(duì)處理50bp以下,和50bp以上的數(shù)據(jù)。
比對(duì)的基本步驟就是兩步:
- 對(duì)參考序列構(gòu)建索引index
- 對(duì)fasta或fastq文件比對(duì)索引
###獲取EBOLA參考基因組
efetch -db nuccore -id AF086833 -format fasta > ebola.fa
### 構(gòu)建索引
bwa index ebola.fa
bowtie2-build ebola.fa ebola.fa
###下載實(shí)驗(yàn)組sra序列
esearch -db sra -query PRJNA257197 |efetch -format runinfo >runinfo.csv
fastq-dump.2 -X 10000 --split-files SRR1972739
###比對(duì),默認(rèn)參數(shù)
REF=ebola.fa
R1=SRR1972739_1.fq
R2=SRR1972739_2.fq
bwa mem $REF $R1 $R2 > output.sam
bowtie2 -x $REF -1 $R1 -2 $R2 >bowtie_out.sam
bowtie2 --very-sensitive-local -x $REF -1 $R1 -2 $R2 >bowtie_out2.sam
bowtie2 -D 20 -R 3 -N 1 -L 20 -x $REF -1 $R1 -2 $R2 >bowtie_out3.sam
## 加上samtools 直接快速sort,多線程-@
bowtie2 -x $REF -1 $R1 -2 $R2 |samtools sort > bowtie_out.sorted.bam
samtools index bowtie_out.sorted.bam
- 注意bwa mem里的
scoring matrix的參數(shù)為比對(duì)的打分矩陣設(shè)置。對(duì)于三代數(shù)據(jù)可用-x ont2d/pacbio - bowtie2里的參數(shù)
--very-sensitive-loacl.
最后,選擇比對(duì)軟件得看具體使用場(chǎng)景了??创蠹矣玫亩嗟陌【褪橇恕?/li>