bam 文件介紹
https://samtools.github.io/hts-specs/SAMv1.pdf
http://bio-bwa.sourceforge.net/bwa.shtml
首先看看bam
| Col | Field | Description |
|---|---|---|
| 1 | QNAME | Query (pair) NAME |
| 2 | FLAG | bitwise FLAG |
| 3 | RNAME | Reference sequence NAME |
| 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
| 5 | MAPQ | MAPping Quality (Phred-scaled) |
| 6 | CIAGR | extended CIGAR string |
| 7 | MRNM | Mate Reference sequence NaMe (‘=’ if same as RNAME) |
| 8 | MPOS | 1-based Mate POSistion |
| 9 | ISIZE | Inferred insert SIZE |
| 10 | SEQ | query SEQuence on the same strand as the reference |
| 11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
| 12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
介紹如下:
QNAME 序列name,來源于fastq文件
-
flag 見:samtools flags 有解釋
SE測序,flag為0好像也表示匹配
https://www.biostars.org/p/7374/
When the flag field is 0, it means none of the bitwise flags specified in the SAM spec (on page 4) are set. That means that your reads with flag 0 are unpaired (because the first flag, 0x1, is not set), successfully mapped to the reference (because 0x4 is not set) and mapped to the forward strand (because 0x10 is not set). Summarizing your data, the reads with flag 4 are unmapped, the reads with flag 0 are mapped to the forward strand and the reads with flag 16 are mapped to the reverse strand.
SE 的flag:
samtools view HG2492W/HG2492W.bam | awk '{print $2}' | sort | uniq -c
1597504 0
1599538 16
5794 256
5756 272
219660 4
samtools view HG2492H/HG2492H.bam | awk '{print $2}' | sort | uniq -c
424935 0
429939 16
1278 256
1114 272
3316 4
PE的flag 比較復(fù)雜,一般是組合的情況:
cat flag | sort | uniq -c
2912 113
415 117
679 121
2836 129
700 133
373 137
2357 141
5073 145
1188332 147
5035 161
1172704 163
2912 177
679 181
415 185
132 321
29 323
38 329
294 337
99 339
278 353
62 355
184 369
39 371
42 377
146 385
47 387
45 393
292 401
69 403
259 417
70 419
193 433
50 435
37 441
2836 65
373 69
700 73
2357 77
5035 81
1172704 83
5073 97
1188332 99
查flags網(wǎng)站:https://broadinstitute.github.io/picard/explain-flags.html
參考解釋:https://www.cnblogs.com/xudongliang/p/5437850.html
3.RNAME 表示參考序列的名稱,比如基因組的染色體編號等,如果沒有比對上則顯示為*
4.POS 表示比對的起始位置,以1開始計數(shù),如果沒有比對上則顯示為0;從5’開始
5.MAPQ
該值的計算方法是mapping的錯誤率的-10log10值,之后四舍五入得到的整數(shù),如果值為255表示mapping值是不可用的,如果是unmapped read則MAPQ為0,一般在使用bwa mem或bwa aln(bwa 0.7.12-r1039版本)生成的sam文件,第五列為60表示mapping率最高,一般結(jié)果是這一列的數(shù)值是從0到60,且0和60這兩個數(shù)字出現(xiàn)次數(shù)最多。該值越大說明越可靠。
R 計算MAPQ
已知P計算MAPQ
-10log10(0.01)
[1] 20
?
-10log10(0.1)
[1] 10
?
已知MAPQ,計算 Pr
?
10(0/-10)
[1] 1
10(1/-10)
[1] 0.7943282
10(2/-10)
[1] 0.6309573
10(3/-10)
[1] 0.5011872
10(4/-10)
[1] 0.3981072
10(5/-10)
[1] 0.3162278
10(6/-10)
[1] 0.2511886
10(7/-10)
[1] 0.1995262
10(8/-10)
[1] 0.1584893
10(9/-10)
[1] 0.1258925
10**(10/-10)
[1] 0.1</pre>
| Pr(mapping position is wrong) | MAPQ |
|---|---|
| 1 | 0 |
| 0.7943282 | 1 |
| 0.3162278 | 5 |
| 0.1 | 10 |
| 0.01 | 20 |
| 0.001 | 30 |
| 0.0001 | 40 |
6.CIGAR 即比對的詳細(xì)情況,簡要比對信息表達式(Compact Idiosyncratic Gapped Alignment Report),其以參考序列為基礎(chǔ),使用數(shù)字加字母表示比對結(jié)果,比如3S6M1P1I4M,前三個堿基被剪切去除了,然后6個比對上了,然后打開了一 個缺口,有一個堿基插入,最后是4個比對上了,是按照順序的;
7.RNEXT PE測序中下一個reads比對的參考系列的名稱,如果沒有則用 " * " 表示,如果和前一個reads比對到同一個參考序列則用" = "表示,SE 是" * "
8.PNEXT PE測序下一個reads比對到參考序列上的位置,如果沒有則用0表示;
9.TLEM query序列的模板長度或者插入長度,Template的長度,最左邊得為正+,最右邊的為負(fù) -,中間的不用 定義正負(fù),不分區(qū)段(single-segment)的比對上,或者不可用時,表示為0;
SEQ reads的序列信息;
QUAL reads的序列質(zhì)量信息,同F(xiàn)ASTQ。
可選字段(optional fields),格式如:TAG:TYPE:VALUE,其中TAG有兩個大寫字母組成,每個TAG代表一類信息,每一行一個TAG只能出現(xiàn)一次,TYPE表示TAG對應(yīng)值的類型,可以是字符串、整數(shù)、字節(jié)、數(shù)組等。
TPNB500301:64:HM3T7AFXY:3:21404:5781:1306 16 chr1 9996 1 4S37M * 0 0 CTCTTCCGATAACCCTAACCCTAACCCTAACCCTAGCCCTA E42DC3=3=@@2DD26@>DD6>@>DD3@@>CD6>D9@><<< BD:Z:KKJJJJJKJJJKIKJJJKIKJJJKIKJJJKIKJKMLJKJJJ MD:Z:31A5 PG:Z:MarkDuplicates RG:Z:FAMILY_HG2493W BI:Z:OPOPPOOPPOOPNOOOOPNOOOOPNOOOOPNOOPPPNOOOO NM:i:1 AS:i:32 XS:i:30
TPNB500301:64:HM3T7AFXY:3:21401:22875:18010 16 chr1 10547 0 11S64M * 0 0 TCCCCCCGACCTCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACG ?44D4D=3<41CE=D3EC@DDC3D2@>=EEDEECD=EDDEC=E=CD=EECECD=CCDCE=D=EE<ECCECB<77< BD:Z:JIIIIJJKKKKKLJJLLKJJJKJJKJKJLKLMKJJJKIKKKJJJKJLMKKKJJJIKLKLJJLMLJJJKKKLJKJJ MD:Z:13C50 PG:Z:MarkDuplicates RG:Z:FAMILY_HG2493W BI:Z:PNNNNOOPPOOPQNOPPPONPONOPOOOPPPPOPOOPNOOOOOOPOPPOPOPOOOPPPQNOPPQPNPONOPOOOO NM:i:1 AS:i:59 XS:i:59