FASTAQ format stores short-read sequences and Phred qualities from NGS platform into a single file.
Every 4 lines represent for a short-read.

圖片 1.png
Four lines per FASTAQ record
1. @ indicates the sequence id(above is longer than sequence itself) 描述行

eg2.png
通常,儀器的使用次數(shù)在200-9999次比較適合。
2. the sequence content of the read 測(cè)到的堿基,A/G/T/C/N,其中N表示無法確定的堿基
3.+ optionally repeat the sequence id (often left empty)
4.quality string 質(zhì)量評(píng)判
A quality score is a number.
One character encodes a number using AscII table
A quality score represents an error probability.
Quality scores are used to represent base calling accuracy, alignment accuracy and other probabilities.
由于如果使用數(shù)字表示質(zhì)量的話,當(dāng)表示質(zhì)量的數(shù)字為兩位及以上時(shí),無法做到一位對(duì)應(yīng)一個(gè)數(shù)字。因此我們需要用其他的方法將表示質(zhì)量的數(shù)字轉(zhuǎn)換位單個(gè)字符,在fastaq的質(zhì)量評(píng)判中我們使用了Ascll table。

ascll.png
The number can be convert to probability based on following formula:
P=10^[-(Q-33)/10]
Start the scale at character 33 (so Q should minus 33)
Quality value (Q) range between 33 to 126
Character range between ‘!’ to ‘~’
Currently, most NGS platform only produce quality value (Q) in the range from 33 to 73. (from ‘!’ to ‘I’).
For P value, from 10^0 to 10^-4 (from 1 to 0.0001).
舉例而言:
比如時(shí)質(zhì)量評(píng)判給了一個(gè)‘!’:
查詢Ascll table,‘!’對(duì)應(yīng)的數(shù)值為33,將其帶入P-value的計(jì)算公式,即P=10^[-(33-33)/10] =10^0=1
Various formats for NGS data:
Input data (raw data): .fasta, .fastq (.SRA)
Annotation data: .gff, .gtf, .bed
Alignment result: .sam, .bam, .wig, .bed
Variant call result: .vcf