VCF文件全稱為Variant Call Format,表示基因組的變異信息,通常為GATK和Samtools軟件處理所得到。
VCF文件大致可以分為兩個部分:
1、以##開頭的頭文件信息
##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.5-0-g36282e4,Date="Tue Apr 03 19:35:05 CST 2018",Epoch=1522755305379,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio//son/son.recal.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/ccds.interval.list] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/son/son.raw.vcf likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[RMSMappingQuality, BaseCounts] excludeAnnotation=[] group=[Standard, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=true keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.5-0-g36282e4,Date="Wed Jun 06 09:33:03 CST 2018",Epoch=1528248783862,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/son.raw.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/SNP/HG002_SNP.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[SNP] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
......
##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19>
##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19>
##reference=file:///opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT son
頭文件信息主要包括vcf文件版本、FORMAT、INFO、參考基因組以及執(zhí)行程序等信息。
表頭各列含義詳解:
1. CHROM(chromosome):染色體
2. POS 變異位點在參考基因組中的位置
3. ID - identifier: variant的ID。比如在dbSNP中有該SNP的id,則會在此行給出;若沒有,則用’.'表示其為一個novel variant。
4. REF - reference base(s):參考堿基,染色體上面的堿基,必須是ATCGN中的一個,N表示不確定堿基
5. ALT - alternate base(s):與參考序列比較發(fā)生突變的堿基
6. QUAL - quality: Phred格式(Phred_scaled)的質(zhì)量值,表 示在該位點存在variant的可能性;該值越高,則
variant的可能性越大;計算方法:Phred值 = -10 * log (1-p) p為variant存在的概率; 通過計算公式
可以看出值為10的表示錯誤概率為0.1,該位點為variant的概率為90%。
7. FILTER - _filter status: 使用上一個QUAL值來進行過濾的話,是不夠的。GATK能使用其它的方法來進行過濾,過濾結(jié)果中通過則該值為”PASS”;若variant不可靠,則該項不為”PASS”或”.”。
8. INFO - additional information: 這一行是variant的詳細(xì)信息,具體如下:
#DP-read depth:樣本在這個位置的reads覆蓋度。是一些reads被過濾掉后的覆蓋度。DP4:高質(zhì)量測序堿基,位于REF或者ALT前后
#QD:通過深度來評估一個變異的可信度。Variant call confidence normalized by depth of sample reads supporting a variant
#MQ:表示覆蓋序列質(zhì)量的均方值RMS Mapping Quality
#FQ:phred值關(guān)于所有樣本相似的可能性
#AC,AF 和 AN:AC(Allele Count) 表示該Allele的數(shù)目;AF(Allele Frequency) 表示Allele的頻率; AN(Allele Number) 表示Allele的總數(shù)目。
對于1個diploid sample而言:則基因型 0/1 表示sample為雜合子,Allele數(shù)為1(雙倍體的sample在該位點只有1個等位基因發(fā)生了突變),
Allele的頻率為0.5(雙倍體的sample在該位點只有50%的等位基因發(fā)生了突變),總的Allele為2; 基因型 1/1 則表示sample為純合的,Allele數(shù)為2,Allele的頻率為1,總的Allele為2。
#MLEAC:Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
#MLEAF:Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
#BaseQRankSum 比較支持變異的堿基和支持參考基因組的堿基的質(zhì)量,負(fù)值表示支持變異的堿基質(zhì)量值不及支持參考基因組的,
正值則相反,支持變異的質(zhì)量值好于參考基因組的。0表示兩者無明顯差異。
#FS 使用F檢驗來檢驗測序是否存在鏈偏好性。鏈偏好性可能會導(dǎo)致變異等位基因檢測出現(xiàn)錯誤。輸出值Phred-scaled p-value,值越大越可能出現(xiàn)鏈偏好性。
#InbreedingCoeff 使用似然法檢驗樣本間的近交系數(shù)(又或者稱為近親關(guān)系)。值越高越可能是近親繁殖。
#MQRankSum 比較支持變異的序列和支持參考基因組的序列的質(zhì)量,負(fù)值表示支持變異的堿基質(zhì)量值不及支持參考基因組的,只針對雜合。
正值則相反,支持變異的質(zhì)量值好于參考基因組的。0表示兩者無明顯差異。實際應(yīng)用中一般過濾掉較小的負(fù)值。
#BaseCounts 所有樣本在變異位點ATCG的數(shù)量
#ClippingRankSum 同前面兩個類似,負(fù)值表示支持變異的read有更的的hard-clip堿基,正值表示支持參考基因組的的read有更多的hard-clip。0最好,無論是正值還是負(fù)值都表示可能可能存在人為偏差。
#ReadPosRankSum 檢測變異位點是否有位置偏好性(是否存在于序列末端,此時往往容易出錯)。最佳值為0,表示變異與其在序列上的位置無關(guān)。負(fù)值表示變異位點更容易在末端出現(xiàn),正值表示參考基因組中的等位基因更容易在末端出現(xiàn)。
#ExcessHet 檢測這些樣本的相關(guān)性,與InbreedingCoeff相似,值越大越可能是錯誤。
#LikelihoodRankSum 評價支持變異和ref的序列與best hyplotype的匹配性,0為最佳值。負(fù)值表示支持變異的read匹配度不及支持ref的匹配度,正值則相反。值越大表示越可能是出現(xiàn)了錯誤。
#HaplotypeScore 分?jǐn)?shù)越高越可能出現(xiàn)錯誤。Higher scores are indicative of regions with bad alignments, typically leading to artifactual SNP and indel calls.
#SOR:也是一個用來評估是否存在鏈偏向性的參數(shù),相當(dāng)于FS的升級版。The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. The reported value is ln-scaled.
#IS:插入缺失或部分插入缺失的reads允許的最大數(shù)量
#G3:ML 評估基因型出現(xiàn)的頻率
#HWE:chi^2基于HWE的測試p值和G3
#CLR:在受到或者不受限制的情況下基因型出現(xiàn)可能性log值
#UGT:最可能不受限制的三種基因型結(jié)構(gòu)
#CGT:最可能受限制三種基因型的結(jié)構(gòu)
#PV4:四種P值得誤差,分別是(strand、baseQ、mapQ、tail distance bias)
#INDEL:表示該位置的變異是插入缺失
#PC2:非參考等位基因的phred(變異的可能性)值在兩個分組中大小不同
#PCHI2:后加權(quán)chi^2,根據(jù)p值來測試兩組樣本之間的聯(lián)系
#QCHI2:Phred scaled PCHI2
#PR:置換產(chǎn)生的一個較小的PCHI2
#QBD:Quality by Depth,測序深度對質(zhì)量的影響
#RPB:序列的誤差位置(Read Position Bias)
#MDV:樣本中高質(zhì)量非參考序列的最大數(shù)目
#VDB:Variant Distance Bias,RNA序列中過濾人工拼接序列的變異誤差范圍
9. FORMAT 和最后一列sample中的信息是對應(yīng)的
#AD 和 DP:AD(Allele Depth)為sample中每一種allele的reads覆蓋度,在diploid中則是用逗號分割的兩個值,
前者對應(yīng)ref基因型,后者對應(yīng)variant基因型; DP(Depth)為sample中該位點的覆蓋度。
#GT:樣品的基因型(genotype)。兩個數(shù)字中間用’/'分 開,這兩個數(shù)字表示雙倍體的sample的基因型。0 表示樣品中有ref的allele;
1表示樣品中variant的allele; 2表示有第二個variant的allele。因此: 0/0 表示sample中該位點為純合的,和ref一致; 0/1 表示sample中該位點為雜合的,有ref和variant兩個基因型; 1/1 表示sample中該位點為純合的,和variant一致。
#GQ:即第二可能的基因型的PL值,相對于最可能基因型的PL值(其PL=0)而言,大于99時,其信息量已不大,因此大于99的全部賦值99。當(dāng)GQ值很小時,意味著第二可能基因型與最可能基因型差別不大。
#GL:三種基因型(RR RA AA)出現(xiàn)的可能性,R表示參考堿基,A表示變異堿基
#DV:高質(zhì)量的非參考堿基
#SP:phred的p值誤差線
#PL:指定的三種基因型的可能性(provieds the likelihoods of the given genotypes)。這三種指定的基因型為(0/0,0/1,1/1),這三種基因型的概率總和為1。
和之前不一致,該值越大,表明為該種基因型的可能性越小。 Phred值 = -10 * log (p) p為基因型存在的概率。