簡(jiǎn)介:
IgBlast是NCBI設(shè)計(jì)開發(fā)的一種專一的blast工具,特定用于比對(duì)抗體( immunoglobulin ,IG)或T細(xì)胞受體( T cell receptor,TR)序列。
為什么要專門開發(fā)一個(gè)比對(duì)軟件呢?他跟blastn等有什么區(qū)別,這得從IG和TR的結(jié)構(gòu)說起,
IG和TR的結(jié)構(gòu)類似,都是由2條輕鏈和2條重鏈構(gòu)成,每條鏈可以分為可變區(qū)(variable domain)和恒定區(qū)(constant domain)??勺儏^(qū)還可以進(jìn)一步分為骨架區(qū)(FR)和互補(bǔ)作用區(qū)(CDR)。
IG或TR識(shí)別抗原的關(guān)鍵在于可變區(qū)的高度可變性,這種可變性是由「基因重排」機(jī)制產(chǎn)生。
1,下載igblast程序和其他必要文件:
下載地址:https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST,
從下面地址下載 internal_data 和 optional_file 整個(gè)目錄,這兩個(gè)目錄也是必需的: https://ftp.ncbi.nih.gov/blast/executables/igblast/release/
2,下載IMGT數(shù)據(jù)庫(kù),或者TCR數(shù)據(jù)庫(kù)
/database/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR
3,igblastn具體使用例子:
igblast數(shù)據(jù)比對(duì)的input文件必須是fasta格式。所以如果你的數(shù)據(jù)是fastq格式,需要先轉(zhuǎn)一下格式(seqkit fq2fa)
igblastn -query test.fa \
-out test.out \
-outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human \
-domain_system imgt \
-ig_seqtype TCR \
-num_threads 4 \
-min_D_match 5 \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3
發(fā)現(xiàn)了一個(gè)新的軟件:igblastwrapper
https://github.com/mikessh/migmap
Motivation
IgBlast is an excellent of V-(D)-J mapping tool able to correctly map even severely hypermutated antibody variants. While being a gold standard, the following limitations of IgBlast v1.4.0 have driven MIGMAP development:
It doesn't extract sequence of CDR3 region directly, neither provide coordinates for CDR3 region in reads. It reports reference Cys residue of Variable segment and Variable segment end in CDR3, but not Phe/Trp residue of J segment that marks the end of CDR3
Output is not straightforward to parse and summarize to a readable clonotype abundance table containing CDR3 sequences, segment assignments and list of somatic hypermutations
It doesn't account for sequence quality
It is somewhat hard to make it running with a custom segment reference and species other than human and mouse
# 查看幫助文檔
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -h
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 --illegal-access ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --by-read --data-dir ./ --use-kabat --allow-incomplete --details ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
都沒有成功,先試一下測(cè)試文件吧
cd /software/biosoftware/migmap-1.0.3
./migmap -R TRA,TRB -S human ./test/sample.fasta.gz migmap.out.txt
成功
回到工作目錄:
不再使用,忽略掉。
發(fā)現(xiàn) igblastn網(wǎng)頁(yè)版和本地版不一致情況
igblastn -query test.igblast.fa \
-out test.fmt.out -outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human \
-domain_system imgt \
-ig_seqtype TCR \
-min_D_match 5 \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3
本地版參數(shù):
igblastn -query test.fa \
-out test.out \
-outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human \
-domain_system imgt \
-ig_seqtype TCR \
-num_threads 4 \
-min_D_match 5 \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3
本地版結(jié)果:
# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: ./database/human_TR_V ./database/human_TR_D ./database/human_TR_J
# Domain classification requested: imgt
# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01 TRAJ33*01,TRAJ53*01 VA No Out-of-frame No -
# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated
GGCAA N/A AGCAA
# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT 84 106 23 19 4 0 82.6
Total N/A N/A 23 19 4 0 82.6
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAV26-2*01 82.609 23 4 0 0 84 106 168 190 0.20 25.2
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV23/OR9-2*01 80.000 20 4 0 0 128 147 35 16 5.2 20.5
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV23/OR9-2*02 80.000 20 4 0 0 128 147 35 16 5.2 20.5
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ33*01 100.000 8 0 0 0 107 114 6 13 8.1 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ53*01 100.000 8 0 0 0 107 114 15 22 8.1 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ18*01 100.000 7 0 0 0 141 147 14 20 31 14.1
網(wǎng)頁(yè)版結(jié)果:使用網(wǎng)頁(yè)上默認(rèn)的參數(shù)。
# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: IG_DB/imgt.TR.Homo_sapiens.V.f.orf.p IG_DB/imgt.TR.Homo_sapiens.D.f.orf IG_DB/imgt.TR.Homo_sapiens.J.f.orf.p
# Domain classification requested: imgt
# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01 TRAJ33*01,TRAJ53*01 VA No Out-of-frame No -
# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGCAA N/A AGCAA
# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT 84 106 23 19 4 0 82.6
Total N/A N/A 23 19 4 0 82.6
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAV26-2*01 82.609 23 4 0 0 84 106 168 190 0.22 25.2
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV7-9*04 92.308 13 1 0 0 138 150 109 121 16 19.0
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV15*01 100.000 11 0 0 0 135 145 13 23 16 19.0
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ33*01 100.000 8 0 0 0 107 114 6 13 9.0 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ53*01 100.000 8 0 0 0 107 114 15 22 9.0 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ18*01 100.000 7 0 0 0 141 147 14 20 34 14.1
不一致的地方在結(jié)果倒數(shù)的第四,五行,比對(duì)結(jié)果不一致。
開始懷疑是設(shè)置evalue值的問題,于是本地版設(shè)置了evalue 20
igblastn -query one.fa -out one.out \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human -domain_system imgt \
-ig_seqtype TCR -min_D_match 5 \
-evalue 20 -show_translation \
-num_clonotype 100 \
-D_penalty -2 \
-J_penalty -2 \
-min_V_length 9 \
-min_J_length 0 \
-num_alignments_V 5 \
-num_alignments_D 5 \
-num_alignments_J 5
結(jié)果一致了,
最終版參數(shù):
igblastn -query xxx.fa \
-out outfile.igblast.fmt7.out \
-outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human -domain_system imgt \
-ig_seqtype TCR \
-num_threads 70 \
-min_D_match 5 \
-evalue 20 \
-show_translation \
-num_clonotype 100 \
-D_penalty -2 \
-J_penalty -2 \
-min_V_length 9 \
-min_J_length 0 \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3
總結(jié),網(wǎng)頁(yè)版和本地版的主要不一樣是evalue不一樣,造成的原因
首先是參數(shù),另一個(gè)是database導(dǎo)致的search space不一樣.
目前igblastn分析方法:
第一步比對(duì):
igblastn -query b0005p1_S1_L001_R1_001.fa \
-out b0005p1_S1_L001_R1_001.fa.fmt7.out \
-outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human -domain_system imgt \
-ig_seqtype TCR -num_threads 10 \
-min_D_match 5 \
-evalue 20 \
-show_translation \
-num_clonotype 100 \
-D_penalty -2 \
-J_penalty -2 \
-min_V_length 9 \
-min_J_length 0 \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3
第二步,將比對(duì)結(jié)果過濾,提取VDJ信息:過濾evalue大于1e-10,
cat *_S1_L001_R1*.fmt7.out |grep -E "^V|^J|^D" |awk '$(NF-1)<1e-10' >S1.R1.igblast.fmt7.out
第三步,統(tǒng)計(jì)每條read上比對(duì)的VDJ個(gè)數(shù),自己寫的腳本。
python deal_igblast.py S1.R1.igblast.fmt7.out S1.R1.igblast.4to1.total.out
第四步, 根據(jù)上一步結(jié)果統(tǒng)計(jì)VDJ組合的結(jié)果,這一步根據(jù)需要。
python filter_igblast_result.py S1.R1.igblast.4to1.total.out S1.R1.out.all.result.txt S1.R1.out.4gene.result.txt S1.R1.out.3gene.result.txt S1.R1.out.2gene.result.txt S1.R1.out.1gene.result.txt
可以統(tǒng)計(jì)出各種組合的信息:
gene_in_reads_4: 0
gene_in_reads_3: 61612
AVAJBV AVAJBJ AVBVBJ AJBVBJ
61612 0 0 0
gene_in_reads_2: 282005
AVAJ AVBV AVBJ AJBV AJBJ BVBJ
99723 26009 0 0 0 156273
gene_in_reads_1: 2234397
AV AJ BV BJ
3605 150 2230039 603
total calculate reads: 2578014
如果是PE read,比對(duì)是需要分開做的,在統(tǒng)計(jì)的時(shí)候可以一起統(tǒng)計(jì)。