生成netMHCpan輸入文件

不能直接輸入vcf真的很煩啊,氨基酸序列fasta太長(zhǎng)了還會(huì)直接Segmentation Fault

過濾獲得PASS的vcf

我只用了gatk里的af-onl-gnomAD文件過濾應(yīng)該還可以annovar基于篩選的注釋,然后用vcftools之類的東西篩選一下

vcftools --gzvcf /path/to/sample.filter.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > /path/to/sample_PASS_only.vcf.gz

(應(yīng)該還可以annovar基于篩選的注釋再篩選一下,但是那樣就需要把a(bǔ)nnovar生成的vcf用convert2annovar.pl轉(zhuǎn)換成avinput格式再用annotate_variation.pl進(jìn)行refGene注釋再coding_change.pl轉(zhuǎn)換成氨基酸序列,配對(duì)vcf包含正常組織和腫瘤組織至少2個(gè)樣本,convert2annovar.pl會(huì)讓選擇全都轉(zhuǎn)換成分開的文件還是只轉(zhuǎn)換第1個(gè),腫瘤是第2個(gè)及以后的,所以如果這樣的話可能需要?jiǎng)h掉正常樣本那一列?)

cat input.vcf |awk -v OFS='\t' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$11}' > output.vcf

annovar轉(zhuǎn)換成氨基酸fasta

annovartable_annovar.pl注釋的時(shí)候不要選--remove,就可以保留中間文件sample.exonic_variant_function

coding_change.pl --includesnp --onlyAltering path/to/sample.exonic_variant_function /path/to/annovar/humandb/hg38_refGene.txt /path/to/annovar/humandb/hg38_refGeneMrna.fa --outfile /path/to/outdir/sample.fasta

轉(zhuǎn)換fasta序列

修改序列ID格式

sed -i 's/>line.*NM_/>NM_/g' /path/to/outdir/*fasta
sed -i 's/ /;/g' /path/to/outdir/*fasta

去掉不需要的序列,此處參考NeoPredPipe腳本vcf_manipulate.py,不直接用這個(gè)軟件是因?yàn)閳?bào)錯(cuò)太多不想處理了-_- |||

#!/usr/bin/env python
import os
from Bio import SeqIO
# 指定需要處理的文件夾路徑
folder_path = '/path/to/outdir/fastaFiles_1/'
id_list = ['WILDTYPE', 'immediate-stopgain', 'stop-loss']
# 遍歷文件夾中的所有 fasta 文件
for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        # 打開文件并讀取所有序列
        records = list(SeqIO.parse(file_path, 'fasta'))
        # 遍歷每個(gè)序列,如果 ID 不包含上述關(guān)鍵詞,將其保留到新列表中
        filtered_records = []
        for record in records:
                if not any(keyword in record.id for keyword in id_list):
                        filtered_records.append(record)
        # 將篩選后的序列寫回文件
        if len(filtered_records) > 0:  # 如果有篩選結(jié)果才更新文件
                with open(file_path, 'w') as f:
                        SeqIO.write(filtered_records, f, 'fasta')

節(jié)選突變附近的序列,此處參考NeoPredPipe腳本NeoPredPipe.py

#!/usr/bin/env python
import sys
import os
import subprocess
from Bio import SeqIO

folder_path = '/public/home/xieruoqi/data/1.CR/neoantigen/fastaFiles_1/'

def ExtractSeq(record, pos, n, frameshift=False):
    '''
    Extracts the proper range of amino acids from the sequence and the epitope length
    :param seq: Sequence of mutated amino acid sequence.
    :param pos: Location of altered amino acid.
    :param n: Length of epitope for predictions.
    :return: Returns an amino acid sequence of appropriate lengths.
    '''
seq = str(record.seq)
    if pos<n: # Cases where start is less than n
        miniseq = seq[0:pos+(n)]
    elif len(seq)-pos < n: # Cases where end is less than n
        miniseq = seq[pos - (n - 1):len(seq)]
    else:
        miniseq = seq[pos-(n-1):pos+(n)]

    if frameshift:
        if pos<n:
            miniseq = seq[0:len(seq)] # When start is not n away from pos, we go until the end of the sequence
        else:
            miniseq = seq[pos-(n-1):len(seq)] # All other cases, we still go until the end of the sequence

    return(miniseq)

epitopeLens = [8,9,10,11]
for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        eps = {n: 0 for n in epitopeLens}
        epsIndels = {n: 0 for n in epitopeLens}
        for n in epitopeLens:
                mySeqs = []
                mySeqsIndels = []
                for record in SeqIO.parse(file_path, 'fasta'):
                        if 'silent' not in record.id.lower() and 'startloss' not in record.id.lower():
                                try:
                                        pos = int(record.id.replace(";;",";").split(";")[5].split('-')[0])-1
                                except ValueError:
                                        try:
                                                pos = int(record.id.replace(";;",";").split(";")[6].split('-')[0])-1
                                        except IndexError:
                                                sys.exit("ERROR: Could not process fasta line in reformatted fasta: %s" % (record.id))
                                if 'dup' in record.id.lower() or 'del' in record.id.lower() or 'ins' in record.id.lower() or 'fs' in record.id:
                                        miniseq = ExtractSeq(record, pos, n, True)
                                        mySeqsIndels.append(">"+record.id[0:100]+"\n"+miniseq+"\n")
                                else:
                                        miniseq = ExtractSeq(record, pos, n)
                                        mySeqs.append(">"+record.id+"\n"+miniseq+"\n")
                eps[n] = mySeqs
                epsIndels[n] = mySeqsIndels
        tmpFiles = {}
        for n in epitopeLens:
                tmpFasta = file_path.replace(".fasta",".tmp.%s.fasta"%(n))
                tmpFastaIndels = file_path.replace(".fasta",".tmp.%s.Indels.fasta"%(n))
                tmpFiles.update({n:tmpFasta})
                tmpFiles.update({str(n)+'.Indels':tmpFastaIndels})
                with open(tmpFasta, 'w') as outFile:
                        for line in eps[n]:
                                outFile.write(line)
                with open(tmpFastaIndels, 'w') as outFile:
                        for line in epsIndels[n]:
                                outFile.write(line)

HLA分型提取及格式轉(zhuǎn)換

先提取可能包含6號(hào)染色體HLA區(qū)域reads,不要用腫瘤樣本,用正常組織樣本
然后選一個(gè)HLA分型預(yù)測(cè)軟件,這里用的是OptiType
大多數(shù)軟件都要求輸入fq

samtools view -hb /path/to/sample_blood.final.bam \
                chr6:28510120-33480577 \
                chr6_GL000250v2_alt:1066038-4433734 \
                chr6_GL000251v2_alt:1283988-4540572 \
                chr6_GL000252v2_alt:1063230-4372611 \
                chr6_GL000253v2_alt:1062914-4548533 \
                chr6_GL000254v2_alt:1062887-4416229 \
                chr6_GL000255v2_alt:1063190-4323464 \
                chr6_GL000256v2_alt:1106450-4577757 \
                > /path/to/sample_blood.chr6.bam && \
        samtools view -hb -@ 8 -f 12 /path/to/sample_blood.final.bam \
                > /path/to/sample_blood.unmapped.bam && \
        samtools merge /path/to/sample_blood.merge.bam /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam && \
        samtools sort -n /path/to/sample_blood.merge.bam -@ 8 -o /path/to/sample_blood.hla.bam && \
        samtools fastq -@ 8 /path/to/sample_blood.hla.bam \
                -1 /path/to/sample_blood.hla.1.fq \
                -2 /path/to/sample_blood.hla.2.fq \
                -s /dev/null && \
        rm /path/to/sample_blood.chr6.bam /path/to/sample_blood.unmapped.bam /path/to/sample_blood.merge.bam && \
        OptiTypePipeline.py -i /path/to/sample_blood.hla.1.fq /path/to/sample_blood.hla.2.fq --dna -v -o /path/to/sample

然后把輸出的tsv文件轉(zhuǎn)換為netMHCpan需要的格式

cat /path/to/sample/*/*.tsv |awk -v OFS=',' 'NR==2 {print $2,$3,$4,$5,$6,$7}' > /path/to/sample/sample.HLA.txt && \
        sed -i "s/,/,HLA-/g" /path/to/sample/sample.HLA.txt && \
        sed -i "s/*//g" /path/to/sample/sample.HLA.txt && \
        sed -i "s/^/HLA-/g" /path/to/sample/sample.HLA.txt

然后就可以順利運(yùn)行netMHCpan啦,不過我還是不太理解為什么netMHCpan預(yù)測(cè)的Affinity對(duì)不上BindLevel

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 一、簡(jiǎn)介 會(huì)得到一系列變異數(shù)據(jù),這些變異數(shù)據(jù)只是告訴我們?cè)诨蚪M的某個(gè)位置發(fā)生了一段序列的改變,至于這個(gè)改變會(huì)不會(huì)...
    生信師姐閱讀 20,278評(píng)論 1 41
  • 結(jié)果文件的解讀 輸出文件1:*.variant_function 第一個(gè)文件包含所有變異的注釋,方法是在每個(gè)輸入行...
    生信師姐閱讀 21,932評(píng)論 2 42
  • 一 結(jié)果文件說明 1 VCF (Variant Call Format)是儲(chǔ)存Variation結(jié)果的文件格式 ...
    九月_1012閱讀 30,572評(píng)論 4 35
  • annovar是一款常用的注釋軟件,可在其官網(wǎng)注冊(cè)后下載。annovar無需安裝,下載后解壓即可直接使用。anno...
    井底蛙蛙呱呱呱閱讀 15,835評(píng)論 1 16
  • 上次我們整理到bwa比對(duì)后得到bam文件,下一步我們要通過GATK流程從bam文件中call variant。 一...
    商乙農(nóng)林科技閱讀 2,202評(píng)論 0 4

友情鏈接更多精彩內(nèi)容