igblastn使用

簡(jiǎn)介:
IgBlast是NCBI設(shè)計(jì)開發(fā)的一種專一的blast工具,特定用于比對(duì)抗體( immunoglobulin ,IG)或T細(xì)胞受體( T cell receptor,TR)序列。
為什么要專門開發(fā)一個(gè)比對(duì)軟件呢?他跟blastn等有什么區(qū)別,這得從IG和TR的結(jié)構(gòu)說起,

IG和TR的結(jié)構(gòu)類似,都是由2條輕鏈和2條重鏈構(gòu)成,每條鏈可以分為可變區(qū)(variable domain)和恒定區(qū)(constant domain)??勺儏^(qū)還可以進(jìn)一步分為骨架區(qū)(FR)和互補(bǔ)作用區(qū)(CDR)。
IG或TR識(shí)別抗原的關(guān)鍵在于可變區(qū)的高度可變性,這種可變性是由「基因重排」機(jī)制產(chǎn)生。

1,下載igblast程序和其他必要文件:
下載地址:https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST,
從下面地址下載 internal_data 和 optional_file 整個(gè)目錄,這兩個(gè)目錄也是必需的: https://ftp.ncbi.nih.gov/blast/executables/igblast/release/

2,下載IMGT數(shù)據(jù)庫(kù),或者TCR數(shù)據(jù)庫(kù)
/database/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR

3,igblastn具體使用例子:
igblast數(shù)據(jù)比對(duì)的input文件必須是fasta格式。所以如果你的數(shù)據(jù)是fastq格式,需要先轉(zhuǎn)一下格式(seqkit fq2fa)

igblastn -query test.fa \
  -out test.out \
  -outfmt 7 \
  -germline_db_V ./database/human_TR_V \
  -germline_db_J ./database/human_TR_J \
  -germline_db_D ./database/human_TR_D \
  -auxiliary_data ./optional_file/human_gl.aux  \
  -organism human \
  -domain_system imgt  \
  -ig_seqtype TCR \
  -num_threads 4 \
  -min_D_match 5  \
  -num_alignments_V 3 \ 
  -num_alignments_D 3 \
  -num_alignments_J 3

發(fā)現(xiàn)了一個(gè)新的軟件:igblastwrapper
https://github.com/mikessh/migmap
Motivation
IgBlast is an excellent of V-(D)-J mapping tool able to correctly map even severely hypermutated antibody variants. While being a gold standard, the following limitations of IgBlast v1.4.0 have driven MIGMAP development:

  • It doesn't extract sequence of CDR3 region directly, neither provide coordinates for CDR3 region in reads. It reports reference Cys residue of Variable segment and Variable segment end in CDR3, but not Phe/Trp residue of J segment that marks the end of CDR3

  • Output is not straightforward to parse and summarize to a readable clonotype abundance table containing CDR3 sequences, segment assignments and list of somatic hypermutations

  • It doesn't account for sequence quality

  • It is somewhat hard to make it running with a custom segment reference and species other than human and mouse

# 查看幫助文檔

java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -h
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 --illegal-access ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --by-read --data-dir ./ --use-kabat --allow-incomplete --details  ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
都沒有成功,先試一下測(cè)試文件吧
cd /software/biosoftware/migmap-1.0.3
./migmap -R TRA,TRB -S human ./test/sample.fasta.gz migmap.out.txt
成功
回到工作目錄:
不再使用,忽略掉。

發(fā)現(xiàn) igblastn網(wǎng)頁(yè)版和本地版不一致情況

igblastn -query test.igblast.fa \
-out test.fmt.out -outfmt 7 \
-germline_db_V ./database/human_TR_V \
-germline_db_J ./database/human_TR_J \
-germline_db_D ./database/human_TR_D \
-auxiliary_data ./optional_file/human_gl.aux \
-organism human \
-domain_system imgt \
-ig_seqtype TCR  \
-min_D_match 5  \
-num_alignments_V 3 \
-num_alignments_D 3 \
-num_alignments_J 3

本地版參數(shù):

igblastn -query test.fa \
  -out test.out \
  -outfmt 7 \
  -germline_db_V ./database/human_TR_V \
  -germline_db_J ./database/human_TR_J \
  -germline_db_D ./database/human_TR_D \
  -auxiliary_data ./optional_file/human_gl.aux  \
  -organism human \
  -domain_system imgt  \
  -ig_seqtype TCR \
  -num_threads 4 \
  -min_D_match 5  \
  -num_alignments_V 3 \
  -num_alignments_D 3 \
  -num_alignments_J 3

本地版結(jié)果:

# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: ./database/human_TR_V ./database/human_TR_D ./database/human_TR_J
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01     TRAJ33*01,TRAJ53*01     VA      No      Out-of-frame    No      -

# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated
GGCAA   N/A     AGCAA

# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT        84      106     23      19      4       0       82.6
Total   N/A     N/A     23      19      4       0       82.6

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAV26-2*01     82.609  23      4       0       0       84      106     168     190     0.20    25.2
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV23/OR9-2*01 80.000  20      4       0       0       128     147     35      16      5.2     20.5
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV23/OR9-2*02 80.000  20      4       0       0       128     147     35      16      5.2     20.5
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ33*01       100.000 8       0       0       0       107     114     6       13      8.1     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ53*01       100.000 8       0       0       0       107     114     15      22      8.1     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ18*01       100.000 7       0       0       0       141     147     14      20      31      14.1

網(wǎng)頁(yè)版結(jié)果:使用網(wǎng)頁(yè)上默認(rèn)的參數(shù)。

# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: IG_DB/imgt.TR.Homo_sapiens.V.f.orf.p IG_DB/imgt.TR.Homo_sapiens.D.f.orf IG_DB/imgt.TR.Homo_sapiens.J.f.orf.p
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01     TRAJ33*01,TRAJ53*01     VA      No      Out-of-frame    No      -

# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGCAA   N/A     AGCAA   

# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT        84      106     23      19      4       0       82.6
Total   N/A     N/A     23      19      4       0       82.6

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAV26-2*01     82.609  23      4       0       0       84      106     168     190     0.22    25.2
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV7-9*04      92.308  13      1       0       0       138     150     109     121     16      19.0
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV15*01       100.000 11      0       0       0       135     145     13      23      16      19.0
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ33*01       100.000 8       0       0       0       107     114     6       13      9.0     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ53*01       100.000 8       0       0       0       107     114     15      22      9.0     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ18*01       100.000 7       0       0       0       141     147     14      20      34      14.1

不一致的地方在結(jié)果倒數(shù)的第四,五行,比對(duì)結(jié)果不一致。

開始懷疑是設(shè)置evalue值的問題,于是本地版設(shè)置了evalue 20

igblastn -query one.fa -out one.out  \
  -germline_db_V ./database/human_TR_V \
  -germline_db_J ./database/human_TR_J \
  -germline_db_D ./database/human_TR_D \
  -auxiliary_data ./optional_file/human_gl.aux  \
  -organism human -domain_system imgt  \
  -ig_seqtype TCR  -min_D_match 5  \
  -evalue 20 -show_translation \
  -num_clonotype 100 \
  -D_penalty -2 \
  -J_penalty -2 \
  -min_V_length 9 \
  -min_J_length 0 \
  -num_alignments_V 5 \
  -num_alignments_D 5 \
  -num_alignments_J 5

結(jié)果一致了,

最終版參數(shù):

igblastn -query xxx.fa \
  -out outfile.igblast.fmt7.out \
  -outfmt 7 \
  -germline_db_V ./database/human_TR_V \
  -germline_db_J ./database/human_TR_J \
  -germline_db_D ./database/human_TR_D \
  -auxiliary_data ./optional_file/human_gl.aux  \
  -organism human -domain_system imgt  \
  -ig_seqtype TCR \
  -num_threads 70 \
  -min_D_match 5 \
  -evalue 20 \
  -show_translation \
  -num_clonotype 100 \
  -D_penalty -2 \
  -J_penalty -2 \
  -min_V_length 9 \
  -min_J_length 0  \
  -num_alignments_V 3 \
  -num_alignments_D 3 \
  -num_alignments_J 3 

總結(jié),網(wǎng)頁(yè)版和本地版的主要不一樣是evalue不一樣,造成的原因
首先是參數(shù),另一個(gè)是database導(dǎo)致的search space不一樣.

目前igblastn分析方法:

第一步比對(duì):

igblastn -query b0005p1_S1_L001_R1_001.fa  \
  -out b0005p1_S1_L001_R1_001.fa.fmt7.out \
  -outfmt 7 \
  -germline_db_V ./database/human_TR_V \
  -germline_db_J ./database/human_TR_J \
  -germline_db_D ./database/human_TR_D \
  -auxiliary_data ./optional_file/human_gl.aux  \
  -organism human -domain_system imgt \
  -ig_seqtype TCR -num_threads 10 \
  -min_D_match 5 \
  -evalue 20 \
  -show_translation \
  -num_clonotype 100 \
  -D_penalty -2 \
  -J_penalty -2 \
  -min_V_length 9 \
  -min_J_length 0 \
  -num_alignments_V 3 \
  -num_alignments_D 3 \
  -num_alignments_J 3

第二步,將比對(duì)結(jié)果過濾,提取VDJ信息:過濾evalue大于1e-10,
cat *_S1_L001_R1*.fmt7.out |grep -E "^V|^J|^D" |awk '$(NF-1)<1e-10' >S1.R1.igblast.fmt7.out

第三步,統(tǒng)計(jì)每條read上比對(duì)的VDJ個(gè)數(shù),自己寫的腳本。
python deal_igblast.py S1.R1.igblast.fmt7.out S1.R1.igblast.4to1.total.out

第四步, 根據(jù)上一步結(jié)果統(tǒng)計(jì)VDJ組合的結(jié)果,這一步根據(jù)需要。

python  filter_igblast_result.py S1.R1.igblast.4to1.total.out S1.R1.out.all.result.txt S1.R1.out.4gene.result.txt S1.R1.out.3gene.result.txt S1.R1.out.2gene.result.txt S1.R1.out.1gene.result.txt
可以統(tǒng)計(jì)出各種組合的信息:
gene_in_reads_4: 0
gene_in_reads_3: 61612
AVAJBV  AVAJBJ  AVBVBJ  AJBVBJ
61612   0       0       0
gene_in_reads_2: 282005
AVAJ    AVBV    AVBJ    AJBV    AJBJ    BVBJ
99723   26009   0       0       0       156273
gene_in_reads_1: 2234397
AV      AJ      BV      BJ
3605    150     2230039 603
total calculate reads: 2578014

如果是PE read,比對(duì)是需要分開做的,在統(tǒng)計(jì)的時(shí)候可以一起統(tǒng)計(jì)。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容