Dragon star Day 2 Pt.1 關(guān)于序列比對、計分方法、BLAST原理及判斷值、BWT原理

Dragon star Day 2 20190730

關(guān)于序列比對、計分方法、BLAST原理及判斷值、BWT原理

補丁筆記之終于上手畫了
-為什么這么難?
-因為你是指南針呀
-。


Dragonstar2019 by Kai Wang

  1. Alignment of short/long-read sequencing data
  2. Genome assembly by short/long-read sequencing

Part Ⅰ Alignment of short/long-read sequencing data

1 Sequence similarity

Sequence similarity is a measure of an empirical relationship between sequences.

A similarity score is therefore aimed to approximate the evolutionary distance between a pair of nucleotide or protein sequences.

Heringa, Jaap(Jul 2008) Sequence Similarity. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005317.pub2]

2 Sequence alignment

  • Consider matches, mismatches, insertions and deletions
    (indels are gaps)
  • Find an optimal alignment between sequences
  • Applications
    • Reference-based read mapping
    • Genome assembly
    • Gene finding

2.1 Pairwise alignment

Sequence alignment between two sequences

  • Find an optimal alignment between two sequences

3 Dynamic programming

動態(tài)規(guī)劃(英語:Dynamic programming,簡稱DP)是一種在數(shù)學(xué)管理科學(xué)、計算機科學(xué)、經(jīng)濟學(xué)生物信息學(xué)中使用的,通過把原問題分解為相對簡單的子問題的方式求解復(fù)雜問題的方法。

動態(tài)規(guī)劃常常適用于有重疊子問題和最優(yōu)子結(jié)構(gòu)性質(zhì)的問題,動態(tài)規(guī)劃方法所耗時間往往遠少于樸素解法。

動態(tài)規(guī)劃背后的基本思想非常簡單。大致上,若要解一個給定問題,我們需要解其不同部分(即子問題),再根據(jù)子問題的解以得出原問題的解。

https://zh.wikipedia.org/wiki/動態(tài)規(guī)劃

  • Purpose
    • To find an alignment between two sequences with best matching scores
  • Three components
    • Recursive calculation - 遞歸計算
    • Tabular arrangement - 填充得分矩陣
    • Traceback - 追溯本源
  • Three common types of pairwise alignments
    • Global alignment: Needleman-Wunsch
    • Local alignment: Smith-Waterman
    • Semi-global alignment

3.1 Global alignment: Needleman-Wunsch

  • Best global alignment

    Have maximal alignment score with all bases in two sequences

  • Score function

3.1.1 Working of Needleman -Wunsch Algorithm

The dynamic programming matrix is defined with three different steps.

  1. Initialization of the matrix with the scores possible.
  2. Matrix filling with maximum scores.
  3. Trace back the residues for appropriate alignment.

vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

  • Assume we have two sequences: P and Q

    • P = TCATGGC
    • Q = TCATC
  • Initial matrix

  • Initialization Step

    Gap score is defined as penalty given to alignment, when we have insertion or deletion.

    This example assumes that there is gap penalty.

    從 Score function 可知空位罰分為1

    vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

  • Matrix Fill Step

    全部寫出來太亂了,就先杰樣吧

  • Trace back Step

    The final step in the algorithm is the trace back for the best alignment.

    By continuing the trace back step by the above defined method, one would reach to the 0th row, 0th column.

    The best alignment among the alignments can be identified by using the maximum alignment score which may be user defined.

    vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

    多種比對結(jié)果:

    purple line:
    TCATGGC
    ||||--|
    TCAT--C
    
    blue line:
    TCATGGC
    ||||-|-
    TCAT-C-
    
    yellow line:
    TCATGGC
    |||||--
    TCATC--
    

3.2 Local alignment: Smith-Waterman

史密斯-沃特曼算法(Smith-Waterman algorithm)是一種進行局部序列比對(相對于全局比對)的算法,用于找出兩個核苷酸序列或蛋白質(zhì)序列之間的相似區(qū)域。該算法的目的不是進行全序列的比對,而是找出兩個序列中具有高相似度的片段。

該算法由坦普爾·史密斯(Temple F. Smith)和邁克爾·沃特曼(Michael S. Waterman)于1981年提出。史密斯-沃特曼算法是尼德曼-翁施算法的一個變體,二者都是動態(tài)規(guī)劃算法。這一算法的優(yōu)勢在于可以在給定的打分方法下找出兩個序列的最優(yōu)的局部比對(打分方法使用了置換矩陣和空位罰分)。該算法和尼德曼-翁施算法的主要區(qū)別在于該算法不存在負分(負分被替換為零),因此局部比對成為可能。回溯從分數(shù)最高的矩陣元素開始,直到遇到分數(shù)為零的元素停止。分數(shù)最高的局部比對結(jié)果在此過程中產(chǎn)生。

https://zh.wikipedia.org/史密斯-沃特曼算法

  • Best local alignment

    Have maximal alignment score with a subset of bases in two sequences

  • Score function

3.2.1 Working of Smith-Waterman Algorithm

The basic steps for the algorithm are similar to that of Needleman-Wunsch algorithm. The steps are:

  1. Initialization of a matrix.
  2. Matrix Filling with the appropriate scores.
  3. Trace back the sequences for a suitable alignment.

vlab.amrita.edu,. (2012). Smith-Waterman Algorithm - Local Alignment of Sequences. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

可見步驟都是一致的,用同樣的序列和 score function 進行比對

  • Assume we have two sequences: P and Q

    • P = TCATGGC
    • Q = TCATC
  • Initial matrix

  • Initialization Step

  • Matrix Fill Step

  • Trace back the sequences for a suitable alignment

    1. Find a best score recursively
    2. Check which operation is used to obtain the current alignment score
    3. Stop when an alignment is 0

    所以最后應(yīng)該是這個亞子:

    TCAT
    ||||
    TCAT
    

4 BLAST: The seed-index-map-extend-merge strategy

The BLAST = Basic Local Alignment Search Tool algorithm is a heuristic for computing optimal "local alignments" between a query sequence and a database containing one or more subject sequences.

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf

4.1 BLAST Basics

  • BLAST is used for sequence similarity search, and much faster than Smith-Waterman method.
  • Compare a query sequence against a database of sequences.
  • BLAST is a collection of algorithms
    • BLASTN: nucleotide sequence against nucleotide database
    • BLASTP: protein sequence against protein database
    • BLASTX: translated nucleotide sequence against protein database
    • TBLASTX: six-frame translations of a nucleotide query sequence
      against the six-frame translations of a nucleotide sequence database
    • TBLASTN: protein sequence against translated nucleotide database

4.2 Process: The seed-index-map-extend-merge strategy

Most popular algorithms use a seed-and-extend approach that operates in two steps:

  1. Find a set of short exact matches called seeds.
  2. Try to extend each seed match to obtain a long inexact match

These matching words are called seeds and BLAST then attempts to extend each seed to a HSP (High-Scoring Segment Pair).

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

4.2.1 The Process
  • Seed and index

    • Construct common words (k-mers) for sequences in a database

    • Assume the database has two sequences and k=3

      A “k-mer” is a word of DNA that is k long.

      Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences.

      k-mers are most useful when they’re long, because then they’re specific.

      The important concept here is that long k-mers are species specific.

      https://angus.readthedocs.io/en/2017/kmers-and-sourmash.html

  • Assume all 3-mers are high-score words

    BLAST needs a threshold to determine high-score words.

  • Seed-map with high-score words

    Obtain words from a query sequence

  • Extend-merge

    • Extend: high-scoring segment

    • BLAST output: 2 alignment containing match information and E-value for each alignment

      E-value 的解釋見 4.2.2

4.2.2 Statistical significance of alignment
  • How to assess the significance of a high-scoring hit to the database?

    One crucial feature of the BLAST algorithm is that it gives an estimation of the statistical signficance of a determined HSP. This is based on so-called Karlin-Altschul statistics for local alignments (without gaps), which involves the Poisson distribution and other concepts.

    即存在 E-value, P-value

    https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

    E is the number of alignments expected by chance during a database search.

  • Karlin-Altschul equation: E(S) = Kmne^{-λS}

    • Score (S) from scoring matrix
    • ?? and λ are two constants
    • ??: the length of a query sequence
    • ??: the sum of bases of all sequences in database
    • The lower E value indicates more significant alignment.

    E value represents the number of matches with score ≥ S that one would expect to see between two random sequences of lengths n and m.

    https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

  • How to assess the P-value of finding a high-scoring pair (HSP)?

    • The number of random HSPs with score >= S is described by a Poisson distribution

    • The probability of finding exactly k HSPs with score >=S is given by P(X = k) = e^{?E} *E^k/k!

    • The probability of finding at least one HSP by chance is

      P(S) = 1 ? P(X = 0) = 1 ? e^{?E}

  • E-value vs P-value

    • E = ? ln(1 ? P)
    • When E and P are very small, they are almost identical

5 BWT

Technically, it is a lexicographical reversible permutation of the characters of a string.

The most important application of BWT is found in biological sciences where genomes(long strings written in A, C, T, G alphabets) don’t have many runs but they do have many repeats.

https://www.geeksforgeeks.org/burrows-wheeler-data-transform-algorithm/

5.1 Bowtie/BWA

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

http://bowtie-bio.sourceforge.net/index.shtml

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

http://bio-bwa.sourceforge.net/

BWA and Bowtie are the most famous implementations of BWT in sequence alignment.

5.2 BWT procedures

  • Given a string S = “TCATC”, BWT

    • Adds $ and assume $<any alphabet 在最后加上$
    • Obtains a suffix array of all cyclic rotations 輪轉(zhuǎn)后綴矩陣
  • Cyclic rotation

    • Keeps an index of the rotated strings in the array 前面的數(shù)字
    • Creates Circular Permutation Table (CPT) 表格本身
  • Sorting

    • Sorts the Circular Permutation Table (CPT) alphabetically
    • Keep the index with the strings 數(shù)字不變,橫向為單位
    • Taking the last column from the sorted array
  • BWT output

    • The last column represents the transformed string

5.3 BWT reverse transformation

With a suffix array, it is easy to recover the original string S


最后,向大家隆重推薦生信技能樹的一系列干貨!

  1. 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小時生信工程師教學(xué)視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招學(xué)徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容