解決的問題
根據(jù)癌癥病人的腫瘤組織和正常組織的測(cè)序數(shù)據(jù),使用一系列軟件對(duì)其進(jìn)行數(shù)據(jù)處理和分析,找出病人的體細(xì)胞變異(主要是SNP和indel),然后對(duì)變異進(jìn)行注釋與解讀,并根據(jù)變異信息精準(zhǔn)地推薦可能對(duì)該名患者療效較好的藥物。
前期準(zhǔn)備工作
首先是搭建 somatic SNP+indel calling pipeline
使用Trimmomatic進(jìn)行reads過濾,BWA進(jìn)行reads比對(duì),Samtools進(jìn)行bam文件排序與建索引等操作,GATK4.0.11.0進(jìn)行去重復(fù)和堿基質(zhì)量分?jǐn)?shù)校正。由于后續(xù)用GATK Mutect2進(jìn)行變異分析,而Mutect2類似HaplotypeCaller,會(huì)進(jìn)行局部重比對(duì),故在此處的數(shù)據(jù)預(yù)處理部分省略了Indel局部重比對(duì)這步。
注:參考基因組數(shù)據(jù)(hg38)和一些輔助數(shù)據(jù)來自GATK Resource Bundle.
師兄參考堿基礦工寫的測(cè)序數(shù)據(jù)預(yù)處理shell腳本(preprocess.sh)如下:
(Markdown語法顯示shell腳本時(shí)語法高亮似乎有些小問題)
#!/bin/bash
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
trimmomatic="${tool_path}/Trimmomatic-0.38/trimmomatic-0.38.jar"
trimmomatic_path=${trimmomatic%/*}
bwa="${tool_path}/bwa-0.7.17/bwa"
samtools="${tool_path}/samtools/bin/samtools"
gatk="${tool_path}/gatk-4.0.11.0/gatk"
#reference download from "gatk bundle" website
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
# $gatk IndexFeatureFile --feature-file $GATK_bundle/hapmap_3.3.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/1000G_omni2.5.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/1000G_phase1.snps.high_confidence.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf
# $gatk IndexFeatureFile --feature-file $GATK_bundle/dbsnp_146.hg38.vcf
# pay attention to "read group header"(GATK website)
if [ $# -lt 6 ]; then #num_of_parameter < 6
echo "usage: $0 fq1 fq2 Read_Group_ID(Lane ID) library sample_ID outdir [number_threads]"
exit 1
fi
fq1=$1 #normal / cancer
fq2=$2 #normal / cancer
RGID=$3 ## Read Group ID,generally can be replaced by Lane ID
library=$4 ## Sequencing library ID
sample=$5 ## Sample ID
outdir=$6 ## Output directory path
RGPL="ILLUMINA"
RGPU=$RGID
#set number of threads
if [ -n "$7" ]; then #if [ $# -lt 7 ]
nt=$7
else
nt=4
fi
## create folder named by the sample
outdir=${outdir}/${sample}
## Acquire the filename of the fastq file, assuming that fq1 and fq2 have the same prefix name
## Get rid of the path prefix
fq_file_name=`basename $fq1`
## Get rid of the suffix,only keep the filename. Consider two possibilities(match and delete the char after %%)
fq_file_name=${fq_file_name%%.R1.fq.gz}
fq_file_name=${fq_file_name%%.R1.fastq.gz}
# output diretory
if [ ! -d $outdir/cleanfq ]; then
mkdir -p $outdir/cleanfq
fi
if [ ! -d $outdir/bwa ]; then
mkdir -p $outdir/bwa
fi
if [ ! -d $outdir/gatk ]; then
mkdir -p $outdir/gatk
fi
echo -e "\n"
echo "RUN info"
echo "fastq1 : $(basename $fq1)"
echo "fastq2 : $(basename $fq2)"
echo "sample ID : $sample"
echo "output dir : $outdir"
echo "threads : $nt"
echo -e "\n*** Started at $(date +'%T %F') ***\n"
## Perform QC to the raw reads using Trimmomatic, where an important parameter, keepBothReads is set True in the ILLUMINACLIP step.
if [ ! -e $outdir/cleanfq/${fq_file_name}.unpaired.2.fq.gz ]; then
java -jar ${trimmomatic} PE \
-threads $nt \
$fq1 $fq2 \
$outdir/cleanfq/${fq_file_name}.paired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.unpaired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.paired.2.fq.gz \
$outdir/cleanfq/${fq_file_name}.unpaired.2.fq.gz \
ILLUMINACLIP:$trimmomatic_path/adapters/TruSeq3-PE-2.fa:2:30:10:8:True \
SLIDINGWINDOW:5:15 LEADING:5 TRAILING:5 MINLEN:50 && echo -e "\n*** fq QC done at $(date +'%T %F') ***\n"
fi
## Use bwa mem to align reads
## -M : Mark shorter split hits as secondary (for Picard compatibility) -Y : soft clipping
## Use samtools to convert sam file to bam file (Samtools is designed to work on a stream. It regards an input file '-' as the standard input)
$bwa mem -t $nt -M -Y \
-R "@RG\tID:$RGID\tPL:$RGPL\tPU:$RGPU\tLB:$library\tSM:$sample" \
$reference/Homo_sapiens_assembly38.fasta \
$outdir/cleanfq/${fq_file_name}.paired.1.fq.gz \
$outdir/cleanfq/${fq_file_name}.paired.2.fq.gz | \
$samtools view -Sb - > $outdir/bwa/${sample}.bam && \
echo -e "\n*** BWA MEM done at $(date +'%T %F') ***\n" && \
# $samtools sort -@ 4 -m 4G -O bam -o $outdir/bwa/${sample}.sorted.bam \
$samtools sort -@ 2 -m 2G -O bam -o $outdir/bwa/${sample}.sorted.bam \
$outdir/bwa/${sample}.bam && echo -e "\n*** sorted raw bamfile done at $(date +'%T %F') ***\n"
## Identifies duplicate reads -M File to write duplication metrics to
$gatk MarkDuplicates \
-I $outdir/bwa/${sample}.sorted.bam \
-M $outdir/bwa/${sample}.markdup_metrics.txt \
-O $outdir/bwa/${sample}.sorted.markdup.bam && echo -e "\n*** ${sample}.sorted.bam MarkDuplicates done at $(date +'%T %F') ***\n" || exit 1
## index a coordinate-sorted BAM file for fast random access. Output .bam.bai file
$samtools index $outdir/bwa/${sample}.sorted.markdup.bam && \
echo -e "\n*** ${sample}.sorted.markdup.bam index done at $(date +'%T %F') ***\n" || exit 1
# Maybe need local realignment?
# Base quality score recalibration(BQSR)
# Does your vcf file have an index? GATK4 does not support on the fly indexing of VCFs anymore.
# Detect systematic errors in base quality scores
$gatk BaseRecalibrator \
-R $reference/Homo_sapiens_assembly38.fasta\
-I $outdir/bwa/${sample}.sorted.markdup.bam \
--known-sites $GATK_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--known-sites $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites $GATK_bundle/dbsnp_146.hg38.vcf.gz \
-O $outdir/bwa/${sample}.sorted.markdup.recal_data.table && \
echo -e "\n*** ${sample}.sorted.markdup.recal_data.table creation done at $(date +'%T %F') ***\n" || exit 1
# Recalibrate the base qualities of the input reads based on the recalibration table produced by the BaseRecalibrator tool, and outputs a recalibrated BAM or CRAM file.
$gatk ApplyBQSR \
--bqsr-recal-file $outdir/bwa/${sample}.sorted.markdup.recal_data.table \
-R $reference/Homo_sapiens_assembly38.fasta \
-I $outdir/bwa/${sample}.sorted.markdup.bam \
-O $outdir/bwa/${sample}.sorted.markdup.BQSR.bam && \
echo -e "\n*** ApplyBQSR done at $(date +'%T %F') ***\n" || exit 1
$samtools index $outdir/bwa/${sample}.sorted.markdup.BQSR.bam && \
echo -e "\n*** ${sample}.sorted.markdup.BQSR.bam index done at $(date +'%T %F') ***\n"
# remove some useless file, only keep ${sample}.sorted.markdup.BQSR.bam
rm -f $outdir/bwa/${sample}.bam $outdir/bwa/${sample}.sorted.bam $outdir/bwa/${sample}.sorted.markdup.bam
預(yù)處理過后,使用Mutect2并行地查找變異,然后使用GetPileupSummaries,CalculateContamination和FilterMutectCalls過濾變異,得到最終結(jié)果。
師兄參考GATK官網(wǎng)和網(wǎng)上某博客寫的somatic variants calling的shell腳本(call_somatic.sh)如下:
#!/bin/bash
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
gatk="${tool_path}/gatk-4.0.11.0/gatk"
# buid_version="hg38"
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
prog=$0 #script name including path
bn=$(basename $0)
bin_dir=${prog%\/$bn} #same as bin_dir=${prog%/*}
bam_SM="$bin_dir/bam_SM.py" #bam_SM.py is a script that recognize the sample name from a bam file
# echo "bam_SM: $bam_SM"
if [ $# -lt 4 ]; then
echo "usage: $0 sample_name normal_bam tumor_bam outdir"
exit 1
fi
sample=$1 #e.g. humanxianshi
normal_bam=$2
tumor_bam=$3
outdir=$4
outdir=${outdir%/} #get rid of the last "/" if exist
if [ ! -d $outdir ]; then
mkdir $outdir
fi
if [ ! -d $outdir/chromosomes ]; then
mkdir $outdir/chromosomes
fi
if [ ! -d $outdir/other ]; then
mkdir $outdir/other
fi
echo -e "\n***Output to $outdir***\n"
echo -e "***Started at $(date +"%T %F")***\n"
chroms="chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY"
for i in $chroms; do
$gatk --java-options "-Xmx1G" Mutect2 \
-R $reference/Homo_sapiens_assembly38.fasta \
-I $tumor_bam \
-tumor $($bam_SM $tumor_bam) \
-L $i \
-I $normal_bam \
-normal $($bam_SM $normal_bam) \
-pon $GATK_bundle/somatic-hg38-1000g_pon.hg38.vcf.gz \
--germline-resource $GATK_bundle/af-only-gnomad.hg38.vcf.gz \
--af-of-alleles-not-in-resource 0.00003125 \
-O $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz && \
echo -e "\n*** Mutect2 $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz done at $(date +"%T %F") ***\n" || exit 1 & #put the process background
done && wait #wait until all the process finish
merge_vcfs_cmd=""
for i in $chroms; do
merge_vcfs_cmd=${merge_vcfs_cmd}"-I $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz "
done && $gatk MergeVcfs ${merge_vcfs_cmd} -O $outdir/${sample}_somatic_unfiltered.vcf.gz && \
echo -e "\n*** MergeVcfs ${outdir}/${sample}_somatic_unfiltered.vcf.gz done at $(date +"%T %F") ***\n" || exit 1
## Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.This estimation informs downstream filtering by FilterMutectCalls.
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $tumor_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_tumor_getpileupsummaries.table && \
echo -e "\n*** GetPileupSummaries done at $(date +"%T %F") ***\n" || exit 1
$gatk --java-options "-Xmx4G" CalculateContamination \
-I $outdir/other/${sample}_tumor_getpileupsummaries.table \
-O $outdir/other/${sample}_tumor_calculatecontamination.table && \
echo -e "\n*** CalculateContamination done at $(date +"%T %F") ***\n" || exit 1
## Filter for confident somatic calls using FilterMutectCalls
## This produces a VCF callset and index. Calls that are likely true positives get the PASS label in the FILTER field,
## and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF.
$gatk --java-options "-Xmx4G" FilterMutectCalls \
-V $outdir/${sample}_somatic_unfiltered.vcf.gz \
--contamination-table $outdir/other/${sample}_tumor_calculatecontamination.table \
-O $outdir/${sample}_somatic_oncefiltered.vcf.gz && \
echo -e "\n*** FilterMutectCalls done at $(date +"%T %F") ***\n" || exit 1
## Select a subset of variants from a VCF file
$gatk SelectVariants \
-V $outdir/${sample}_somatic_oncefiltered.vcf.gz \
-O $outdir/${sample}_somatic_oncefiltered.PASS.vcf.gz \
--exclude-filtered && \
echo -e "\n*** SelectVariants: $outdir/${sample}_somatic_oncefiltered.PASS.vcf.gz done at $(date +"%T %F") ***\n"
小插曲
當(dāng)我使用IMPACT上的測(cè)試數(shù)據(jù)跑通了這兩個(gè)腳本后,開始學(xué)習(xí)GATK關(guān)于CNV calling的相關(guān)軟件的文檔,準(zhǔn)備搭建CNV calling pipeline。然而老師說CNV對(duì)藥物推薦的作用相對(duì)較小,不太重要,可以不做。重點(diǎn)應(yīng)轉(zhuǎn)向評(píng)測(cè)這個(gè)pipeline的準(zhǔn)確率如何。
老師讓我模仿DeepVariant的benchmark方法進(jìn)行評(píng)測(cè)??赐暾撐陌l(fā)現(xiàn)DeepVariant研究的是從正常人的測(cè)序數(shù)據(jù)中尋找變異,用的變異數(shù)據(jù)金標(biāo)準(zhǔn)均為germline variants,而我研究的是somatic variants,故不能適用。后來發(fā)現(xiàn)可以在正常樣本中人工加入變異構(gòu)造模擬的腫瘤樣本,從而得到金標(biāo)準(zhǔn)。然而老師說不要用模擬數(shù)據(jù),要用真實(shí)的數(shù)據(jù)。雖然我發(fā)現(xiàn)好像現(xiàn)在很多的somatic variant calling algorithm的benchmark都是用模擬數(shù)據(jù)進(jìn)行的。
Benchmarking somatic mutation calling pipeline
Data source: TCGA GDC Data Portal
主要目的:TCGA官網(wǎng)上有預(yù)處理過的normal-tumor配對(duì)的bam文件以及對(duì)應(yīng)的用某個(gè)軟件(如Mutect2)得到的含體細(xì)胞變異信息的vcf文件,使用目前較流行的RTG-Tools的vcfeval模塊,可將我得到的vcf與官方提供的vcf進(jìn)行比較,從而初步判斷我的pipeline是否正確地搭建。
12.10
從三種疾病中分別挑選一個(gè)樣本進(jìn)行評(píng)測(cè):
- Lung Squamous Cell Carcinoma (LUSC): TCGA-58-8391
- Head and Neck Squamous Cell Carcinoma (HNSC): TCGA-IQ-A61H
- Sarcoma (SARC): TCGA-DX-A7ET
按照上述腳本,使用Mutect2 找出變異,經(jīng)過過濾后,與TCGA上的vcf比較,F(xiàn)值最高的是肺鱗狀細(xì)胞瘤(LUSC)樣本,約為0.9;頭頸鱗狀細(xì)胞瘤(HNSC)樣本F值為0.86;F值最低為肉瘤(SARC)樣本,約為0.72。
12.11
下載并測(cè)試了另一個(gè)肉瘤樣本TCGA-Z4-A9VC(記作SARC2),這次F值僅有約0.61。
為什么肉瘤樣本的準(zhǔn)確率會(huì)特別低呢?
LUSC樣本TCGA上的vcf中PASS變異有974個(gè);HNSC樣本TCGA上的vcf中PASS變異有367個(gè);SARC 1號(hào)樣本TCGA上的vcf中PASS變異僅有135個(gè),而SARC 2號(hào)樣本僅有77個(gè)。
由于SARC 1號(hào)和2號(hào)的樣本太小,測(cè)出的F值可能不太可靠,應(yīng)該下載一個(gè)具有較多變異的肉瘤樣本重新測(cè)試。
12.12
下載了第三個(gè)肉瘤樣本TCGA-3B-A9HT(記作SARC3),變異數(shù)為1050個(gè),重跑流程,結(jié)果測(cè)得F值為0.8。
測(cè)試結(jié)果匯總?cè)缦拢?/p>
| Sample | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|---|
| LUSC | 898 | 129 | 65 | 0.8744 | 0.9333 | 0.9029 |
| HNSC | 346 | 90 | 22 | 0.7936 | 0.9402 | 0.8607 |
| SARC1 | 115 | 71 | 19 | 0.6183 | 0.8603 | 0.7195 |
| SARC2 | 58 | 57 | 19 | 0.5043 | 0.7564 | 0.6052 |
| SARC3 | 901 | 306 | 145 | 0.7465 | 0.8620 | 0.8001 |
數(shù)據(jù)和方法都一樣,為什么準(zhǔn)確率還是不夠高呢?
最重要的原因之一:
從TCGA的官方數(shù)據(jù)分析流程中可以看到,TCGA Mutect2使用的PoN(Panel of Normals)是使用TCGA上幾千個(gè)正常人的血液樣本構(gòu)建的:
“The MuTect2 pipeline employs a "Panel of Normals" to identify additional germline mutations. This panel is generated using TCGA blood normal genomes from thousands of individuals that were curated and confidently assessed to be cancer-free. This method allows for a higher level of confidence to be assigned to somatic variants that were called by the MuTect2 pipeline.”
而我的pipeline上Mutect2使用的PoN是GATK官網(wǎng)上給出的,這個(gè)PoN是用千人基因組上的正常人樣本構(gòu)建。
GATK Doc#11136上指出:
"Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches."
也就是說構(gòu)建PoN的樣本與研究變異的樣本各方面條件應(yīng)該相似。我測(cè)試時(shí)用的是TCGA上的tumor-normal樣本,構(gòu)建PoN用的卻是千人基因組中的正常樣本,故得到的結(jié)果會(huì)與TCGA上的vcf有一定差異。
另外,TCGA官方流程還設(shè)置了--contamination_fraction_to_filter為0.02 ,以及使用了cosmic.vcf和dbsnp.vcf:
#GATK 3
java -jar GenomeAnalysisTK.jar \
-T MuTect2 \
-R <reference> \
-L <region> \
-I:tumor <tumor.bam> \
-I:normal <normal.bam> \
--normal_panel <pon.vcf> \
--cosmic <cosmic.vcf> \
--dbsnp <dbsnp.vcf> \
--contamination_fraction_to_filter 0.02 \
-o <mutect_variants.vcf> \
--output_mode EMIT_VARIANTS_ONLY \
--disable_auto_index_creation_and_locking_when_reading_rods
然而,將--contamination_fraction_to_filter從默認(rèn)值(0)改為0.02后,在LUSC和HNSC兩個(gè)樣本上測(cè)試,結(jié)果和之前完全一樣。所以這個(gè)參數(shù)應(yīng)該對(duì)結(jié)果影響不大,可暫時(shí)不理會(huì)。
GATK Mutect2 官網(wǎng)文檔中有關(guān)于--cosmic和--dbsnp兩個(gè)參數(shù)的描述:
“MuTect2 has the ability to use COSMIC data in conjunction with dbSNP to adjust the threshold for evidence of a variant in the normal. If a variant is present in dbSNP, but not in COSMIC, then more evidence is required from the normal sample to prove the variant is not present in germline.”
dbSNP數(shù)據(jù)可在GATK Resource Bundle中找到,而COSMIC相關(guān)數(shù)據(jù)可在官網(wǎng)中下載(編碼和非編碼區(qū)的變異數(shù)據(jù)都要下載),經(jīng)過一定處理后,得到最終的COSMIC數(shù)據(jù)文件。(合并,改坐標(biāo),排序,重建索引)
然而,GATK4已經(jīng)不支持這兩個(gè)參數(shù)了!而是改用了--germline-resource這個(gè)參數(shù)。這個(gè)參數(shù)接受的數(shù)據(jù)文件也在GATK Resource Bundle中給出。若想使用dbSNP和COSMIC的數(shù)據(jù),可考慮將Mutect2這一部分流程更換回GATK3,但是根據(jù)GATK論壇上官方人員所說,似乎這樣做沒有必要。
"I think the team found in testing that using a germline resource, PoN, and matched normal is enough to filter out possible germline variants. The variants in the tumor sample that pass the thresholds are most likely somatic mutations, so there is no need for a "whitelist"."
根據(jù)GATK4官方教程,調(diào)整Mutect2軟件參數(shù)
調(diào)整--af-of-alleles-not-in-resource 的值
師兄根據(jù)網(wǎng)上教程,將--af-of-alleles-not-in-resource 的值設(shè)為0.00003125。根據(jù)GATK官網(wǎng)教程的推薦,當(dāng)處理的數(shù)據(jù)為外顯子測(cè)序數(shù)據(jù)且--germline-resource用GATK Resource Bundle中的af-only-gnomad_grch38.vcf.gz時(shí),--af-of-alleles-not-in-resource 的值應(yīng)設(shè)為0.0000025。
更改參數(shù)值前后的結(jié)果對(duì)比如下(粗體為更改參數(shù)值后的結(jié)果):
| Sample | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|---|
| LUSC | 898 | 129 | 65 | 0.8744 | 0.9333 | 0.9029 |
| LUSC | 912 | 137 | 51 | 0.8694 | 0.9477 | 0.9069 |
| HNSC | 346 | 90 | 22 | 0.7936 | 0.9402 | 0.8607 |
| HNSC | 350 | 92 | 18 | 0.7919 | 0.9511 | 0.8642 |
| SARC3 | 901 | 306 | 145 | 0.7465 | 0.8620 | 0.8001 |
| SARC3 | 906 | 318 | 140 | 0.7402 | 0.8668 | 0.7985 |
可以看到,修改此參數(shù)后,可提高Sensitivity,但會(huì)降低Precision,即會(huì)引入更多的假陽性結(jié)果。最終的F值在LUSC和HNSC樣本中稍有提升,但在SARC3樣本中反而略有下降??傮w而言F值變化不大。綜合考慮,F值不變的情況下,為了使變異盡可能被找到,靈敏度提升帶來的好處應(yīng)該要大于假陽性結(jié)果增加帶來的壞處。故--af-of-alleles-not-in-resource 可設(shè)為0.0000025。
進(jìn)一步添加參數(shù) --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
“This filter removes from analysis paired reads whose mate maps to a different contig.”
GATK官網(wǎng)教程中添加了此參數(shù),禁用此過濾器,目的是為了將配對(duì)的read map到不同染色體上的那些reads也納入考慮,從而使得可供分析的read更多。經(jīng)過在LUSC、HNSC和SARC3三個(gè)樣本中測(cè)試,F(xiàn)值等結(jié)果和之前的毫無變化。
根據(jù)GATK4官方教程,調(diào)整過濾步驟的軟件參數(shù)
1. 分別對(duì)normal和tumor樣本做GetPileupSummaries
"CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument."
修改后的代碼為:
## Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.
## This estimation informs downstream filtering by FilterMutectCalls.
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $normal_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_normal_getpileupsummaries.table || exit 1
$gatk --java-options "-Xmx4G" GetPileupSummaries \
-I $tumor_bam \
-L $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-V $GATK_bundle/small_exac_common_3.hg38.vcf.gz \
-O $outdir/other/${sample}_tumor_getpileupsummaries.table && \
echo -e "\n*** GetPileupSummaries done at $(date +"%T %F") ***\n" || exit 1
$gatk --java-options "-Xmx4G" CalculateContamination \
-I $outdir/other/${sample}_tumor_getpileupsummaries.table \
-matched $outdir/other/${sample}_normal_getpileupsummaries.table \
-O $outdir/other/${sample}_tumor_calculatecontamination.table && \
echo -e "\n*** CalculateContamination done at $(date +"%T %F") ***\n" || exit 1
然而,經(jīng)過在LUSC,HNSC和SARC3這三個(gè)樣本中測(cè)試,結(jié)果毫無變化。
最新的代碼還是決定保留這個(gè)修改。
2. 更換GetPileupSummaries步驟中所用的common germline variant sites VCF
GetPileupSummaries會(huì)接受一個(gè)common germline variant sites VCF,統(tǒng)計(jì)樣本reads在這些變異位點(diǎn)的分布情況(Summarizes counts of reads that support reference, alternate and other alleles for given sites)。這個(gè)文件之前用的是GATK Resource Bundle上給出的small_exac_common_3.hg38.vcf.gz,現(xiàn)根據(jù)網(wǎng)上一篇教程,嘗試自己生成這樣的VCF。
在GATK GetPileupSummaries工具文檔中可以看到:
The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
而這個(gè)網(wǎng)上教程的操作和這段話是吻合的。
首先,使用GATK SelectVariants,挑出GATK Resource Bundle中的af-only-gnomad.hg38.vcf.gz文件中的Biallelic SNP:
/home/yuanqm/diploma_project/Tools/gatk-4.0.11.0/gatk SelectVariants \
-R /home/yuanqm/diploma_project/Data/reference/hg38/Homo_sapiens_assembly38.fasta \
-V /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.vcf.gz \
--select-type-to-include SNP \
--restrict-alleles-to BIALLELIC \
-O /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.vcf.gz
此時(shí),如果直接用af-only-gnomad.hg38.SNP_biallelic.vcf.gz這個(gè)文件,運(yùn)行舊的腳本,會(huì)出現(xiàn)兩個(gè)問題:
- 內(nèi)存溢出。此壓縮包有3.5G,解壓后更是達(dá)到10G以上,加載此文件會(huì)占用大量內(nèi)存,而--java-options "-Xmx4G"限制了可用內(nèi)存最大為4G。然而,即使調(diào)高JVM的最大可用內(nèi)存,或者去掉這一限制,仍然會(huì)報(bào)錯(cuò),因?yàn)镚etPileupSummaries -L和-V讀取這個(gè)VCF文件兩次,超過了本計(jì)算機(jī)的最大可用內(nèi)存(16G),故應(yīng)將程序遷移至服務(wù)器上運(yùn)行。
- 坐標(biāo)報(bào)錯(cuò):
A USER ERROR has occurred: Badly formed genome unclippedLoc:
Contig chr1_KI270766v1_alt given as location,
but this contig isn't present in the Fasta sequence dictionary.
解決方案:遍歷一次af-only-gnomad.hg38.SNP_biallelic.vcf.gz,檢查其中每一個(gè)變異的坐標(biāo),若其坐標(biāo)沒有出現(xiàn)在Homo_sapiens_assembly38.dict中,則將此條變異刪除。
另外,查看GetPileupSummaries之前所用的文件small_exac_common_3.hg38.vcf.gz可發(fā)現(xiàn)里面的變異只有chr1-23、chrX和chrY。
所以,若直接將af-only-gnomad.hg38.SNP_biallelic.vcf.gz中chr1-23、chrX和chrY的變異挑出,舍棄其他坐標(biāo)的變異,可以簡(jiǎn)化處理文件的流程,且得到的文件也能比之前的small_exac_common_3.hg38.vcf.gz有所改進(jìn)。
處理af-only-gnomad.hg38.SNP_biallelic.vcf.gz的流程如下:
cd /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck
## 由于grep不能處理二進(jìn)制文件(如壓縮包),故應(yīng)先將af-only-gnomad.hg38.SNP_biallelic.vcf.gz解壓
grep '^#' af-only-gnomad.hg38.SNP_biallelic.vcf > af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr[1-9][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr1[0-9][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr2[0-3][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
grep '^chr[XYM][[:blank:]]' af-only-gnomad.hg38.SNP_biallelic.vcf >> af-only-gnomad.hg38.SNP_biallelic.selected.vcf
## 再將得到的文件使用bcftools壓縮和建索引
/home/yuanqm/diploma_project/Tools/bcftools/bin/bcftools \
view /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf \
-Oz -o /home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
/home/yuanqm/diploma_project/Tools/bcftools/bin/bcftools index -t \
/home/yuanqm/diploma_project/Data/GATK_bundle/hg38_from_ck/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz
12.25
使用af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz在服務(wù)器上重跑程序,運(yùn)行了將近20小時(shí)后,依然出現(xiàn)內(nèi)存溢出錯(cuò)誤。
因?yàn)闃颖臼鞘褂猛怙@子測(cè)序,故可將af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz中的外顯子變異挑出作為新的輸入,從而進(jìn)一步減少內(nèi)存負(fù)荷。
然而,GATK tutorial中指出:
So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.
由此可見,基于樣本間污染的過濾,對(duì)最后結(jié)果的影響不會(huì)特別大,故此部分的優(yōu)化工作可暫時(shí)放一放。
12.26
將af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz中的外顯子變異挑出作為新的輸入:
$gatk SelectVariants \
-R path/Homo_sapiens_assembly38.fasta \
-V path/af-only-gnomad.hg38.SNP_biallelic.selected.vcf.gz \
-L path/liftover_37To38_exome.targets.bed \
-O path/af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz
-L參數(shù)接受的文件標(biāo)識(shí)了外顯子的區(qū)域,也就是要挑選的區(qū)域。
此文件的獲取方法如下:
首先在1000 Genome的網(wǎng)站上可獲得1KGP.exome.targets.bed,此文件的坐標(biāo)為hg19,然后使用liftOver工具進(jìn)行坐標(biāo)轉(zhuǎn)換,得到liftover_37To38_exome.targets.bed。需要注意的是,為了減少失敗的轉(zhuǎn)換,根據(jù)開發(fā)者的建議,可勾選“Allow multiple output regions”選項(xiàng),“Minimum ratio of bases that must remap”的值也可嘗試適當(dāng)調(diào)低(本人沒有更改這個(gè)值)。
af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz約60M,不用擔(dān)心內(nèi)存溢出的問題,將此文件作為GetPileupSummaries -L和-V的參數(shù),得到的結(jié)果如下:
| Sample | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure | |
|---|---|---|---|---|---|---|---|
| LUSC | 912 | 137 | 51 | 0.8694 | 0.9477 | 0.9069 | |
| Previous | HNSC | 350 | 92 | 18 | 0.7919 | 0.9511 | 0.8642 |
| SARC | 906 | 318 | 140 | 0.7402 | 0.8668 | 0.7985 | |
| LUSC | 912 | 136 | 51 | 0.8702 | 0.9477 | 0.9073 | |
| Current | HNSC | 350 | 91 | 18 | 0.7937 | 0.9511 | 0.8653 |
| SARC | 906 | 320 | 140 | 0.7390 | 0.8668 | 0.7978 |
由此可見,最終結(jié)果與之前相比確實(shí)變化不大,這與GATK tutorial中“基于污染的過濾重要性相對(duì)較小”的觀點(diǎn)是吻合的。
最終的腳本還是選擇用回之前GATK Resource Bundle中給出的small_exac_common_3.hg38.vcf.gz。
總結(jié):
將三個(gè)樣本的數(shù)據(jù)綜合起來,可得到GATK4的最終結(jié)果(以TCGA上用GATK3得到的VCF作為基準(zhǔn)):
| True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|
| 2168 | 547 | 209 | 0.7985 | 0.9121 | 0.8515 |
由于TCGA上用GATK3得到的VCF實(shí)際上不是金標(biāo)準(zhǔn),下面這種表述方式會(huì)更好:
| Overlap | V1 | V2 | Overlap Rate |
|---|---|---|---|
| 2168 | 547 | 209 | 74.15% |
Overlap: Variants that are observed both in the VCF that GATK4 derived and in the VCF that TCGA derived using GATK3
V1: Variants that are observed in the VCF that GATK4 derived but not in the VCF that TCGA derived using GATK3
V2: Variants that are observed in the VCF that TCGA derived using GATK3 but not in the VCF that GATK4 derived
Overlap Rate = Overlap / (Overlap + V1 + V2)
根據(jù)以上的結(jié)果,可初步判斷我的GATK4變異檢測(cè)流程應(yīng)大體搭建正確,其與TCGA上GATK3檢測(cè)得到的VCF存在差異可能是由于軟件版本不同以及PON不同等原因。
使用GATK3進(jìn)行變異檢測(cè)
由于TCGA數(shù)據(jù)庫上用的Mutect2是舊版的(GATK3),故現(xiàn)在嘗試也使用GATK3,看是否能復(fù)現(xiàn)結(jié)果。
GATK3也可在GATK官網(wǎng)中下載,下載下來得到一個(gè)jar包(不同于GATK4有一堆文件)。
值得注意的是,在GATK3的Mutect2中自帶過濾功能,也就是說Mutect2得到的VCF的Filter域已經(jīng)被標(biāo)識(shí)了PASS或被過濾的原因。故使用GATK3時(shí),不需要也不能夠使用FilterMutectCalls,故GetPileupSummaries和CalculateContamination也是不需要的??梢栽谧詈笫褂肧electVariants挑出PASS的變異,也可以省略此步。
#!/bin/bash
# Test the somatic calling pipeline using data from TCGA database.
# Bam files from TCGA have done "Markduplicates", indel realignment and BQSR.
# set some paths
tool_path="$HOME/diploma_project/Tools"
data_path="$HOME/diploma_project/Data"
gatk4="${tool_path}/gatk-4.0.11.0/gatk"
gatk3="${tool_path}/GenomeAnalysisTK-3.8-1-0/GenomeAnalysisTK.jar"
# buid_version="hg38"
reference="${data_path}/reference/hg38"
GATK_bundle="${data_path}/GATK_bundle/hg38_from_ck"
if [ $# -lt 4 ]; then
echo "usage: $0 sample_name normal_bam tumor_bam outdir"
exit 1
fi
sample=$1 #e.g. TCGA-A6-6650 (patient ID)
normal_bam=$2
tumor_bam=$3
outdir=$4
outdir=${outdir%/} #get rid of the last "/" if exist
if [ ! -d $outdir ]; then
mkdir $outdir
fi
if [ ! -d $outdir/chromosomes ]; then
mkdir $outdir/chromosomes
fi
echo -e "\n***Output to $outdir***\n"
echo -e "***Started at $(date +"%T %F")***\n"
chroms="chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY"
for i in $chroms; do
java -Xmx1g -jar $gatk3 \
-T MuTect2 \
-R $reference/Homo_sapiens_assembly38.fasta \
-L $i \
-I:tumor $tumor_bam \
-I:normal $normal_bam \
--normal_panel $GATK_bundle/somatic-hg38-1000g_pon.hg38.vcf.gz \
--cosmic $data_path/GATK_bundle/other_data/hg38_cosmic_data/Cosmic_coding_and_noncoding_hg38_sorted.vcf.gz \
--dbsnp $GATK_bundle/dbsnp_146.hg38.vcf.gz \
--contamination_fraction_to_filter 0.02 \
-o $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz \
--output_mode EMIT_VARIANTS_ONLY \
--disable_auto_index_creation_and_locking_when_reading_rods && \
echo -e "\n*** Mutect2 $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz done at $(date +"%T %F") ***\n" || exit 1 & #put the process background
done && wait #wait until all the process finish
merge_vcfs_cmd=""
for i in $chroms; do
merge_vcfs_cmd=${merge_vcfs_cmd}"-I $outdir/chromosomes/${sample}_somatic.${i}.vcf.gz "
done && $gatk4 MergeVcfs ${merge_vcfs_cmd} -O $outdir/${sample}_somatic_filtered.vcf.gz && \
echo -e "\n*** MergeVcfs ${outdir}/${sample}_somatic_filtered.vcf.gz done at $(date +"%T %F") ***\n" || exit 1
## Select a subset of variants from a VCF file
$gatk4 SelectVariants \
-V $outdir/${sample}_somatic_filtered.vcf.gz \
-O $outdir/${sample}_somatic_filtered.PASS.vcf.gz \
--exclude-filtered && \
echo -e "\n*** SelectVariants: $outdir/${sample}_somatic_filtered.PASS.vcf.gz done at $(date +"%T %F") ***\n"
## Remove some useless files.
rm -r -f $outdir/chromosomes
若把TCGA數(shù)據(jù)庫上用Mutect2得到的VCF當(dāng)作金標(biāo)準(zhǔn),將我寫的GATK3腳本找到的變異與之比較,可得如下結(jié)果:
My GATK3 vs GATK3 in TCGA:
| Sample | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|---|
| LUSC | 974 | 55 | 1 | 0.9466 | 0.9990 | 0.9721 |
| HNSC | 366 | 34 | 2 | 0.9150 | 0.9946 | 0.9531 |
| SARC | 988 | 112 | 63 | 0.8982 | 0.9401 | 0.9186 |
將三個(gè)樣本的數(shù)據(jù)綜合起來,可得到我的GATK3的最終結(jié)果(以TCGA上用GATK3得到的VCF作為基準(zhǔn)):
| True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|
| 2328 | 201 | 66 | 0.9205 | 0.9724 | 0.9457 |
由于TCGA上用GATK3得到的VCF實(shí)際上不是金標(biāo)準(zhǔn),可使用下面這種表述方式:
| Overlap | V1 | V2 | Overlap Rate |
|---|---|---|---|
| 2328 | 201 | 66 | 89.71% |
Overlap: Variants that are observed both in the VCF that my GATK3 derived and in the VCF that TCGA derived using GATK3
V1: Variants that are observed in the VCF that my GATK3 derived but not in the VCF that TCGA derived using GATK3
V2: Variants that are observed in the VCF that TCGA derived using GATK3 but not in the VCF that my GATK3 derived
Overlap Rate = Overlap / (Overlap + V1 + V2)
可見,我的GATK3得到的結(jié)果與TCGA上的十分接近,但由于PON的不同,結(jié)果有差別是不可避免的。由此結(jié)果可以肯定我搭建的GATK3流程基本正確。
利用TCGA上其他三個(gè)工具得到的VCF構(gòu)建近似的金標(biāo)準(zhǔn),用于比較GATK3與GATK4的性能
由上述結(jié)果可知,我搭建的GATK3和GATK4的流程應(yīng)該基本正確。現(xiàn)綜合TCGA上其他三個(gè)工具得到的VCF,規(guī)定被大于等于兩個(gè)軟件找出的變異為高度可信的變異,從而構(gòu)建近似的金標(biāo)準(zhǔn)。用此金標(biāo)準(zhǔn),比較TCGA上的GATK3,我的GATK3以及我的GATK4的性能。
構(gòu)建金標(biāo)準(zhǔn)
VCF文件預(yù)處理:
VarScan2的VCF注釋中有這一行:
“##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">”
而SomaticSniper的VCF注釋中有類似的一行:
“##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">”
如果直接用CombineVariants合并兩個(gè)VCF會(huì)報(bào)沖突錯(cuò)誤?,F(xiàn)手動(dòng)將VarScan2的VCF的此行注釋改成SomaticSniper的格式。
另外,來自SomaticSniper和VarScan2的VCF中的變異均為過濾后的變異,而來自Muse的VCF中的變異有PASS的也有被過濾掉的變異。應(yīng)挑出PASS變異,使得后續(xù)合并后的VCF中只包含PASS的變異,以方便挑出出現(xiàn)在2個(gè)及以上VCF的變異。
## 代碼示例
gatk SelectVariants \
-V '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.vcf' \
--exclude-filtered \
-O '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.PASS.vcf'
使用CombineVariants合并三個(gè)VCF:
注意:CombineVariants是GATK3的軟件
## 代碼示例
java -jar .../diploma_project/Tools/GenomeAnalysisTK-3.8-1-0/GenomeAnalysisTK.jar \
-T CombineVariants \
-R .../diploma_project/Data/reference/hg38/Homo_sapiens_assembly38.fasta \
--variant:VarScan2 .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/VarScan2/b25b354f-75a8-4309-9e9a-95b0cf7149af.vcf \
--variant:MuSE .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/MuSE/488e3499-debf-4d33-8d09-4ce4e8c501a4.PASS.vcf \
--variant:SomaticSniper .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SomaticSniper/14f0b05f-4b9f-4135-aa44-1f9bcc7fc59f.vcf \
-o .../diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS.vcf \
-genotypeMergeOptions PRIORITIZE \
-priority VarScan2,MuSE,SomaticSniper
PRIORITIZE這個(gè)模式的效果是當(dāng)一個(gè)變異出現(xiàn)在多個(gè)文件中時(shí),合并的VCF只會(huì)保留其中一條記錄。具體保留哪條的優(yōu)先順序由-priority參數(shù)定義。這里將VarScan2的合并優(yōu)先級(jí)定為最高是因?yàn)镮MPACT數(shù)據(jù)庫是用VarScan2找的變異,感覺它應(yīng)該比較靠譜;SomaticSniper得到的VCF的Filtered域全為“.”,且變異記錄是最多的,感覺不是很靠譜,故優(yōu)先級(jí)定為最低。
挑出高度可信的變異:
gatk SelectVariants \
-V '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS.vcf' \
-select "set == 'VarScan2-SomaticSniper' || set == 'VarScan2-MuSE' || set == 'MuSE-SomaticSniper' || set == 'Intersection'" \
-O '/home/yuanqm/diploma_project/Data/TCGA/SARC/TCGA-3B-A9HT-vcf/SARC3_union3_PASS_confident.vcf'
SelectVariants -select參數(shù)接受的是"JEXL expressions",教程見此。
GATK3 vs GATK4
| Sample | Tool | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|---|---|
| GATK3_TCGA | 842 | 133 | 178 | 0.8636 | 0.8255 | 0.8441 | |
| LU | My_GATK3 | 880 | 149 | 140 | 0.8552 | 0.8627 | 0.8590 |
| SC | My_GATK4 | 862 | 187 | 151 | 0.8217 | 0.8520 | 0.8366 |
| My_GATK4.1 | 862 | 186 | 151 | 0.8225 | 0.8520 | 0.8370 | |
| GATK3_TCGA | 273 | 95 | 110 | 0.7418 | 0.7128 | 0.7270 | |
| HN | My_GATK3 | 283 | 117 | 100 | 0.7075 | 0.7389 | 0.7229 |
| SC | My_GATK4 | 290 | 152 | 93 | 0.6561 | 0.7572 | 0.703 |
| My_GATK4.1 | 290 | 151 | 93 | 0.6576 | 0.7572 | 0.7039 | |
| GATK3_TCGA | 554 | 497 | 238 | 0.5271 | 0.6995 | 0.6012 | |
| SA | My_GATK3 | 596 | 504 | 196 | 0.5418 | 0.7525 | 0.6300 |
| RC | My_GATK4 | 610 | 614 | 179 | 0.4984 | 0.7740 | 0.6063 |
| My_GATK4.1 | 610 | 616 | 179 | 0.4976 | 0.7740 | 0.6057 |
綜合三個(gè)樣本的結(jié)果如下:
| Tool | True-pos | False-pos | False-neg | Precision | Sensitivity | F-measure |
|---|---|---|---|---|---|---|
| GATK3_TCGA | 1669 | 725 | 526 | 0.6972 | 0.7604 | 0.7274 |
| My_GATK3 | 1759 | 770 | 436 | 0.6955 | 0.8014 | 0.7447 |
| My_GATK4 | 1762 | 953 | 423 | 0.6490 | 0.8064 | 0.7192 |
| My_GATK4.1 | 1762 | 953 | 423 | 0.6490 | 0.8064 | 0.7192 |
其中,My_GATK4的GetPileupSummaries步驟用的是原始的small_exac_common_3.hg38.vcf.gz,而My_GATK4.1用的是自己構(gòu)建的af-only-gnomad.hg38.SNP_biallelic.selected.exome.vcf.gz,結(jié)果和之前一樣,二者無顯著差異。
另外,可見我搭建的GATK4效果相對(duì)最差。這樣的結(jié)果也不算很奇怪,因?yàn)楣倬W(wǎng)說GATK4很多功能還在測(cè)試開發(fā)當(dāng)中,不適合用作生產(chǎn)。另外一種可能是我搭建的GATK4的參數(shù)還未調(diào)至最優(yōu),或者是這個(gè)近似金標(biāo)準(zhǔn)本身就不完全正確。
最后,可見我搭建的GATK3結(jié)果要比TCGA上的GATK3的結(jié)果要好。
最終的癌癥病人基因突變分析流程可以使用GATK3來搭建。