久久一二≡四区,国产专题一区二区,久久对白国产

00 寫在前面

僅針對(duì)人類WGS或WES數(shù)據(jù)，供參考。
時(shí)間管理某一點(diǎn)：能自動(dòng)化的工作盡量自動(dòng)化，不要時(shí)間用在毫無意義的重復(fù)上。腦袋好比肌肉，需要不斷提高負(fù)重的鍛煉才能達(dá)到“有效鍛煉”。

01 軟件準(zhǔn)備

這里主要使用GATK的best practice進(jìn)行數(shù)據(jù)處理，GATK一直在更新維護(hù)，建議下載最新版，這里用的是GenomeAnalysisTK-3.8-0-ge9d806836。參考鏈接如下：https://software.broadinstitute.org/gatk/
GATK的維護(hù)做得很棒，在使用過程中我遇到的所有問題（各種報(bào)錯(cuò)信息）都能在官網(wǎng)搜索到解決方案。擅于搜索，也是學(xué)習(xí)。

流程化操作主要包括以下三部分：

測(cè)序數(shù)據(jù)評(píng)估及質(zhì)控：評(píng)估測(cè)序質(zhì)量（fastqc,multiqc)，去除接頭（cutadapt),去除reads兩端低質(zhì)量堿基（fastx_toolkit)
SNV calling:序列比對(duì)（bwa）,索引建立及統(tǒng)計(jì)比對(duì)情況（samtools）,建立索引（ picard）,SNV檢測(cè)（gatk）
SNV 過濾：gatk，數(shù)據(jù)庫。

優(yōu)先使用conda安裝軟件，網(wǎng)絡(luò)OK的情況下，方便快捷，可以很好地解決依賴問題；不行的話，下載安裝至biosoft/下

conda install fastqc
conda install multiqc
···
picard="/biosoft/picard-tools-1.124/picard.jar"  #添加到.bash_profile
java -jar $picard -h
GATK=/biosoft/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar  #添加到.bash_profile
java -jar $GATK -h

02 數(shù)據(jù)庫準(zhǔn)備

下載參考基因組
常用版本為hg19和hg38，建議使用hg38(最新版）。網(wǎng)速不給力時(shí)，用Aspera下載，非?？臁?br> 下載鏈接：http://hgdownload.cse.ucsc.edu/downloads.html
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

#下載某條染色體：如chr22
mkdir DB/  && cd DB/
DB="~/../DB/"
wget --timestamping  'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz'
gzip -d chr22.fq.gz

#下載hg19
wget --timestamping  'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*';  
md5sum filename  #下載的文件較多時(shí)，校驗(yàn)md5以確定文件是完整的
#合并為hg19.fa
gunzip chr[1-22].gz
gunzip chrX.gz chrY.gz chrM.gz
for i in $(seq 1 22) X Y M; do cat chr${i}.fa >>hg19.fa; done
rm -fr chr*.fasta

#下載hg38
wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz';

下載數(shù)據(jù)庫
從GATK resource bundle下載dbSNP,1000G相關(guān)數(shù)據(jù)，用于GATK best practice過程中對(duì)變異進(jìn)行校正。

GATK_FTP=ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19
wget $GATK_FTP/dbsnp_138.hg19.vcf.gz
wget $GATK_FTP/1000G_phase1.indels.hg19.sites.vcf.gz
wget $GATK_FTP/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

03 測(cè)序數(shù)據(jù)質(zhì)量評(píng)估及質(zhì)控

03-1 質(zhì)量評(píng)估Fastqc, multiqc

一般公司提交的是初步質(zhì)控處理過的clean data，最好把raw data一并要過來。一般公司會(huì)對(duì)數(shù)據(jù)進(jìn)行基本的質(zhì)控，但我們還要再確認(rèn)一次，為什么呢？個(gè)人遇到的情況包括以下幾種：

部分樣本的接頭可能去得不完全
此外，為了數(shù)據(jù)量好看（更大），一般質(zhì)量過得去的情況下，公司不會(huì)對(duì)reads前后端測(cè)序質(zhì)量較低的堿基進(jìn)行剪切，需要自己根據(jù)情況進(jìn)行處理
有的公司對(duì)不同樣本處理不一致，每個(gè)樣本的clean data中reads長(zhǎng)度可能不一致，有些后續(xù)分析是要求不同樣本的reads一致
以上情況發(fā)生時(shí)，就需要自己對(duì)數(shù)據(jù)進(jìn)行（再次）質(zhì)控了。

cd fqDIR/  #fq文件存放位置
mkdir qcDIR/ qcDIR_multiqc/  #qc結(jié)果存放位置; multiqc結(jié)果存放位置
gzip sample1_1.fq.gz  #寫Shell腳本批量操作
echo "Spend time:$SECONDS\n" #This line will tell u how long it takes to finish the job
fastqc sample1_1.fq -o qcDIR/ -t 6 -d qcDIR/ >>fastqc.r.log 2>>fastqc.e.log

#利用multiqc整合結(jié)果，方便批量查看
multiqc qcDIR/ -o qcDIR_multiqc/

結(jié)果如下（fastqc_per_base_sequence_quality_plot）所示

03-2 去除INDEX和接頭

一般測(cè)序，每個(gè)樣本有獨(dú)特的INDEX，還有通用的接頭。需要與公司確認(rèn)好具體的建庫方式，INDEX序列，兩端接頭序列(注意方向），不然一不小心切錯(cuò)了都不知道...

#[待檢測(cè)]
cutadapt -a ADAPTER_FWD -A ADAPTER_REV --pair-filter=both -o sample1_cut_1.fq -p sample1_cut_2.fq sample1_1.fq sample1_2.fq
#注意更改INDEX序列，adapter方向

03-3 去除低質(zhì)量堿基，過濾低質(zhì)量reads

fastx_toolkit：http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

03-4 再次質(zhì)量評(píng)估

fastqc, multiqc

最后得到clean Data，用于后續(xù)分析，如SNV calling

對(duì)germline變異，現(xiàn)在最常用的是GATK的best practice流程。如果是癌癥樣本，GATK有對(duì)應(yīng)的流程，也有其他的軟件，如Varscan等，不在本文討論范圍內(nèi)。

04 SNV calling

04-1 數(shù)據(jù)預(yù)處理: map, sort, mark duplicates, base recalibration

給參考基因組建立索引

#create index
samtools faidx hg19.fa
#creat dictionary, required by many processing and analysis tools.
java -jar $picard CreateSequenceDictionary\
R=hg19.fa\
O=hg19.dict  #該文件可以看到每條染色體的命名，后面QBSR時(shí)指定染色體需用到

比對(duì)到參考基因組,生成SAM文件

bwa index -a bwtsw hg19.fa  #*.fa >2G,用-a bwtsw;*.fa < 2G，用-a is (默認(rèn)）
time bwa mem -M -t 10 -R '@RG\tID:HKV2KALXX.7\tSM:sample1\tPL:illumina\tLB:sample1' 
 $DB/hg19.fa sample1_1.fq sample1_2.fq >aligned_sample1.sam
#-M :Mark shorter split hits as secondary (essential for Picard compatibility)
#-t: number of threads
#-R:定義頭文件,如果在此步驟不進(jìn)行頭文件定義，在后續(xù)GATK分析中還是需要重新增加頭文件。具體信息可從樣本的fq文件中獲得。
#@RG: Read Group，必須要有，否則GATK無法進(jìn)行calling；比對(duì)速度比不加@RG更快  
##ID：Read group identifier, flowcell + lane name and number in Illunima data   >>>> ID:FLOWCELL1.LANE1(每個(gè)flowcell的每個(gè)lane是unique的), EX. HKV2KALXX.7
#PL: platform>>> ILLUMINA
#SM: Sample >>>sample1
#LB: DNA preparation library identifier. MarkDuplicates uses the LB to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.  
#PU:Platform Unit  >>>HKV2KALXX.2  #太復(fù)雜，非必須。

轉(zhuǎn)換成BAM文件，sort, mark duplicates

java -jar $picard SortSam \ 
    INPUT=aligned_sample1.sam \ 
    OUTPUT=sorted_sample1.bam \ 
    SORT_ORDER=coordinate

java -jar $picard MarkDuplicates \ 
    INPUT=sorted_sample1.bam \ 
    OUTPUT=dedup_sample1.bam \
    METRICS_FILE=metrics_sample1.txt
#just mark, not remove. most tools in GATK will ingore the marked reads；If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options.

java -jar $picard BuildBamIndex \ 
    INPUT=dedup_sample1.bam 

#查看@RG
samtools view -H dedup_sample1.bam | grep '@RG'

如果同一樣本（同一文庫）多次上樣測(cè)序，需要將多次測(cè)序結(jié)果合并后，再進(jìn)行sort,mark。

#merge bam files(利用picard可以保留@RG; samtools合并時(shí)會(huì)丟掉@RG）
time java -jar $picard MergeSamFiles\ 
I=1.bam I=2.bam\
O=12.bam

Recalibrate base quality scores = run BQSR

校正測(cè)序錯(cuò)誤和部分實(shí)驗(yàn)所致誤差

#analyze patterns of covariation in the dataset
java -jar $GATK \ 
     -T BaseRecalibrator \ 
     -R $DB/hg19.fa \ 
     -I dedup_sample1.bam  \ 
     -L chr22 \ # restrict analysis to only chromosome 22,名稱參考前面hg19.dict文件
     -knownSites $DB/dbsnp_138.hg19.vcf \ 
     -knownSites $DB/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ 
     -knownSites $DB/1000G_phase1.indels.hg19.sites.vcf \
     -o recal_data_sample1.table 

#a second pass to analyze covariation remaining after recalibartion
java -jar $GATK \
     -T BaseRecalibrator \
     -R $DB/hg19.fa \
     -I dedup_sample1.bam \
     -L chr22 \  
     -knownSites $DB/dbsnp_138.hg19.vcf \
     -knownSites $DB/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
     -knownSites $DB/1000G_phase1.indels.hg19.sites.vcf \
     -BQSR recal_data_sample1.table
     -o post_recal_data_sample1.table

#generate before/after plots
java -jar $GATK \
     -T AnalyzeCovariates \
     -R $DB/hg19.fa \
     -L chr22 \  
     -before recal_data_sample1.table \
     -after post_recal_data_sample1.table \
     -plots recalibration_sample1_plots.pdf

#Apply the recalibration to your sequence data
java -jar $GATK \
     -T PrintReads \
     -R $DB/hg19.fa \
     -I dedup_sample1.bam  \
     -L chr22 \
     -BQSR recal_data_sample1.table  \
     -o recal_sample1.bam

04-2 檢測(cè)SNV: calling, consolidate, filter(VQSR/hard filter)

Note: 2013以前HaplotypeCaller還沒完全開發(fā)好，用UnifiedGenotyper較多；目前作者推薦使用HaplotypeCaller，在indel檢測(cè)上更準(zhǔn)確。

Call variants per-sample with `HaplotypeCaller` (in GVCF mode)

java -jar $GATK \ 
     -T HaplotypeCaller \ 
     -R $DB/hg19.fa \ 
     -I dedup_sample1.bam \  
     -L targets.interval_list \ #使用exome數(shù)據(jù)時(shí)，可用該參數(shù)指定文件（exome targets）；也可以直接指定某染色體，如chr22
     --genotyping_mode DISCOVERY \ #默認(rèn)參數(shù)；GENOTYPE_GIVEN_ALLELES，僅使用給定文件中的allele
     -stand_call_conf 30 \ #minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called; 低于該值則標(biāo)記為“LowQual"
     -ERC GVCF \
     -o sample1.raw_variants.g.vcf

Note : targets.interval_list 格式不同的格式對(duì)應(yīng)不同后綴名
https://gatkforums.broadinstitute.org/gatk/discussion/1319/collected-faqs-about-interval-lists

Picard-style .interval_list
必須有SAM-like header;
<chr> <start> <stop> <target_name> with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4

GATK-style .list
<chr>:<start>-<stop> no sequence dictionary is necessary
BED files with extension .bed
<chr> <start> <stop> with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates
VCF files

Consolidate GVCFs

Joint-Call Cohort：GenomicsDBImport GenotypeGVCFs**
合并多個(gè)樣本的g.vcf文件，用于群體的Genotype;有這一步比分開做genotype更準(zhǔn)確。如果有新增樣本，可以在這一步加入。

#Consolidate GVCFs with `GenomicsDBImport` or `CombineGVCFs`**
#GenomicsDBImport：新方法，速度快，但目前一次只能處理一條染色體
gatk GenomicsDBImport \
    -V gvcfs/sample1.g.vcf \
    -V gvcfs/sample2.g.vcf \
    -V gvcfs/sample3.g.vcf \  #樣本較多，可以用-V gvcfs/*.g.vcf
    --genomicsdb-workspace-path my_database \  #自動(dòng)生成存放結(jié)果的文件夾
    --intervals chr22  #目前一次只能處理一條染色體，或一個(gè)interval （2018）；可以用腳本分別操作后,再用CombineVariants合并。
#joint genotyping
java -jar $GATK \
     -T GenotypeGVCFs \
     -R $DB/hg19.fa \
     -V gendb://my_database \
     -G StandardAnnotation -newQual \
     -O raw_variants.vcf 

#CombineGVCFs：舊方法，速度慢，但是可以一次全部合并（合并不同樣本的文件）
 java -jar $GATK \
      -T CombineGVCFs \
      -R $DB/hg19.fa \
      -V sample1.g.vcf \
      -V sample2.g.vcf \ 
      -o combined.g.vcf
#樣本較多時(shí)，按如下操作
find gvcf/ -name "*.g.vcf" >input.list
nohup  java -Xmx20g -jar $GATK -T CombineGVCFs -R $DB/hg19.fa --variant  input.list -o combined.g.vcf &  #Xmx20g 使用20G內(nèi)存

#joint genotyping
java -jar $GATK \
     -T GenotypeGVCFs \
     -R $DB/hg19.fa \
     -V combined.g.vcf \
     -G StandardAnnotation -newQual \
     -o raw_variants.vcf

NOTE："." in the filter field is not bad (it means that no filtering was applied).
FILTER field:
PASS if the variant passed all filters.
. no filtering has been applied to the records.
It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

Filter Variants by Variant (Quality Score) Recalibration:`VariantRecalibrator` `ApplyRecalibration`

GATK檢測(cè)變異較為lenient，以確保不會(huì)遺漏真實(shí)的變異，但因此也會(huì)有假陽性高的問題，因此還需要進(jìn)一步的過濾。過濾方法有兩種：

VQSR：要求樣本量大（至少30個(gè)WES或1個(gè)WGS、有足夠的已有變異信息(比如人類或模式生物）；否則請(qǐng)用hard filtering
hard filtering: 利用vcf中INFO中的參數(shù)進(jìn)行過濾

VQSR: recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.(利用HapMap,OMIM,1000G,dbSNP數(shù)據(jù)對(duì)SNP進(jìn)行校正；利用Mills, dbSNP對(duì)indel進(jìn)行校正）

hard filtering 公司常用的過濾標(biāo)準(zhǔn)：

GATK_SNP_Filter=--filterExpression 'DP<10' --filterName LOW_DP --filterExpression 'QD<2.0' --filterName LOW_QD --filterExpression 'FS>60.0' --filterName HIGH_FS --filterExpression 'SOR>4.0' --filterName HIGH_SOR --filterExpression 'MQ<40.0' --filterName LOW_MQ --filterExpression 'MQRankSum<-12.5'  --filterName LOW_MQRS --filterExpression 'ReadPosRankSum<-8.0' --filterName LOW_RPRS -disable_auto_index_creation_and_locking_when_reading_rods

GATK_INDEL_Filter=--filterExpression 'DP<10' --filterName LOW_DP --filterExpression 'QD<2.0' --filterName LOW_QD --filterExpression 'FS>200.0' --filterName HIGH_FS --filterExpression 'SOR>10.0' --filterName HIGH_SOR --filterExpression 'InbreedingCoeff<-0.8' --filterName LOW_INC --filterExpression 'ReadPosRankSum<-20.0' --filterName LOW_RPRS -disable_auto_index_creation_and_locking_when_reading_rods

下面介紹VQSR

下載數(shù)據(jù)：hapmap_3.3.hg19.vcf 1000G_omni2.5.hg19.vcf 1000G_phase1.snps.high_confidence.hg19.vcf dbsnp_138.hg19.vcf, Mills_and_1000G_gold_standard.indels.hg19.vcf
FileZile host: ftp.broadinstitute.org user:gsapubftp-anonymous （md5sum filename，確認(rèn)文件完整性(md5））

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz

recalibrate SNP

#Build the SNP recalibration model
java -jar $GATK \ 
    -T VariantRecalibrator \ 
    -R $DB/hg19.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \  #僅適用于WGS，不適用于WES（去掉即可）
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \  #結(jié)果文件，校正后的數(shù)據(jù)，用于下一步
    -tranchesFile recalibrate_SNP.tranches \ #記錄命令中使用的quality score thresholds
    -rscriptFile recalibrate_SNP_plots.R  #提前安裝好R相應(yīng)的包，才能出圖

#Apply the desired level of recalibration to the SNPs in the call set
java -jar $GATK \ 
    -T ApplyRecalibration \ 
    -R $DB/hg19.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \  #recalibrated quality scores (VQSLOD)；值越低，假陽性越低，越specific,同時(shí)丟掉真實(shí)位點(diǎn)的風(fēng)險(xiǎn)升高 （GATK推薦使用99.9%，provides the highest sensitivity you can get while still being acceptably specific.）這個(gè)參數(shù)依據(jù)自己需求可調(diào)，可以多次運(yùn)行之后自己覺得
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf  #結(jié)果文件，包含所有原始數(shù)據(jù)，其中SNP多了一列recalibrated quality scores (VQSLOD)，并根據(jù)是否通過過濾標(biāo)注了PASS或FILTER

recalibrate INDEL

#Build the Indel recalibration model
 java -jar $GATK \ 
    -T VariantRecalibrator \ 
    -R $DB/hg19.fa \ 
    -input raw_variants.vcf \ 
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an QD \ 
    -an DP \ #僅適用于WGS，不適用于WES(去掉即可）
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ #clusters of variants that have similar properties，WES樣本數(shù)低于30時(shí)使用；樣本數(shù)足夠時(shí)，先去掉這個(gè)參數(shù)跑，再加上跑，然后自己選擇是否要加這個(gè)參數(shù)
    -recalFile recalibrate_INDEL.recal \  #結(jié)果文件，contains the recalibration data
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

#Apply the desired level of recalibration to the Indels in the call set
java -jar $GATK \ 
     -T ApplyRecalibration \ 
     -R $DB/hg19.fa \ 
     -input recalibrated_snps_raw_indels.vcf \ 
     -mode INDEL \ 
     --ts_filter_level 99.0 \ #Best Practice推薦99.9%
     -recalFile recalibrate_INDEL.recal \ 
     -tranchesFile recalibrate_INDEL.tranches \ 
     -o recalibrated_variants.vcf #結(jié)果文件

VariantRecalibration之前，如果缺乏相應(yīng)的參數(shù)，如QD,FS,DP etc.先用VariantAnnotator獲得相應(yīng)的參數(shù)值。

#annotate VCF
 java -jar $GATK \
      -T VariantAnnotator \
      -R $DB/hg19.fa \
      -I input.bam \
      -V raw_variants.vcf \
      -o annotate_raw_variants.vcf \ #輸出文件，如果遇到不符合注釋的位點(diǎn)，就默默地略過了，可以檢查header，確認(rèn)做了哪些注釋
      --useAllAnnotations \
      --dbsnp dbsnp.vcf

05 SNV初步統(tǒng)計(jì)分析

VQSR重在檢測(cè)SNV，下一步需要使用其他信息提高genotype準(zhǔn)確性，過濾低質(zhì)量genotype，進(jìn)一步精煉SNV。該流程尤其適用于以下研究：關(guān)注變異的拷貝數(shù)，家系中變異的de novo origin（https://software.broadinstitute.org/gatk/documentation/article?id=11074）。

05-1從VCF中提取子集：`SelectVariants`

SNP：SNPs with call rates <95% , MAF <1% or significant deviation from Hardy–Weinberg equilibrium (HWE) in controls (HWE P ≤ 1 × 10?6) were excluded.

#提取子集：SNP, biallelic, call rate>95%, filtered samples<10% 
java -jar $GATK \
      -T SelectVariants \
      -R $DB/hg19.fa \
      -ef \ #--excludeFiltered 去除被過濾的位點(diǎn)（在FILTER中，標(biāo)注不是.或PASS）
      -env \ #--excludeNonVariants 去除非變異位點(diǎn)
      -noTrim \ #保留原始的allele
      -selectType SNP \ #INDEL,SNP,MIXED等,提取某類變異
      -restrictAllelesTo BIALLELIC \ #ALL,BIALLELIC,MULTIALLELIC，提取某類變異
      --maxNOCALLfraction 0.05\ #Maximum fraction of samples with no-call genotypes
      --maxFractionFilteredGenotypes 0.1 \ #最多有多少比例的個(gè)體被過濾掉了
      -V recalibrated_variants.vcf \
      -o SNP_bia.vcf 

#提取子集：SNP，MULTIALLELIC， call rate>95%, filtered samples<10%
java -jar $GATK \
      -T SelectVariants \
      -R $DB/hg19.fa \
      -ef \ 
      -env \ 
      -noTrim \ 
      -selectType SNP \ 
      -selectType MNP \
      -restrictAllelesTo MULTIALLELIC \ 
      -V recalibrated_variants.vcf \
      -o SNP_multi.vcf 

#提取子集：INDEL
java -jar $GATK \
      -T SelectVariants \
      -R $DB/hg19.fa \
      -ef \ 
      -env \ 
      -noTrim \ 
      -selectType INDEL \ 
      -V recalibrated_variants.vcf \
      -o indel.vcf

05-2 精煉：`CalculateGenotypePosteriors` `VariantFiltration` `VariantAnnotator` `Funcotator (experimental)`

該流程僅適用于biallelic SNP，不適用于INDEL

#Step 1: Derive posterior probabilities of genotypes
#FORMAT列會(huì)增加“PP”，同時(shí)"GQ"已更新（基于PP進(jìn)行計(jì)算，方法同之前基于PL計(jì)算的GQ）
 java -jar $GATK \
      -T CalculateGenotypePosteriors \
      -R $DB/hg19.fa \
      -supporting 1000G_phase1.snps.high_confidence.hg19.sites.vcf \  #必須要有
      -V SNP_bia.vcf \
      -o SNP_bia.withPosteriors.vcf

#報(bào)錯(cuò)：Variant does not contain the same number of MLE allele counts as alternate alleles for record at chr1:94496602 說明supporting文件里在該位點(diǎn)是mulitallele,所以先提取出supporting文件中的biallelic位點(diǎn)（新文件用作supporting文件，再跑一次CalculateGenotypePosteriors）
java -jar $GATK \
      -T SelectVariants \
      -R $DB/hg19.fa  \
      -ef -env -noTrim  \
      -selectType SNP \
      -restrictAllelesTo BIALLELIC \
      -V 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
      -o 1000G_phase1.snps.high_confidence.hg19.sites.bia.vcf  #用作supporting文件

#Step 2: Filter low quality genotypes
#genotypes with GQ < 20 based on the posteriors are filtered out.GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct.
 java -jar $GATK \
      -T VariantFiltration \
      -R $DB/hg19.fa \
      -G_filter "GQ < 20.0" \  #用FORMAT FILED中的變量過濾
      -G_filterName lowGQ \
      --setFilteredGtToNocall \ #被過濾的位點(diǎn)，將genotype改為./.(相當(dāng)于missing data
      -V SNP_bia.withPosteriors.vcf \
      -o SNP_bia.withPosteriors_GQ20.vcf 

#Step 3: Annotate possible de novo mutations
#Tool used: VariantAnnotator 針對(duì)Family

#Step 4: Functional annotation of possible biological effects
#尚處開發(fā)階段，類似SnpEff or Oncotator

05-3 結(jié)果評(píng)估

對(duì)結(jié)果進(jìn)行評(píng)估，判斷其是否符合一般規(guī)律。https://software.broadinstitute.org/gatk/documentation/article?id=11071

Sanger sequencing
多個(gè)芯片結(jié)果的一致性
與truth set比較，評(píng)估一致性，有兩個(gè)假設(shè)：truth set已經(jīng)被驗(yàn)證過，可以看做是gold-standard; 你的樣本和truth set來源的樣本要有相似的遺傳組成（比如都是東亞人群）

Sequencing Type # of Variants* TiTv Ratio
WGS ~4.4M 2.0-2.1
WES ~41k 3.0-3.3
*for a single sample

Variants數(shù)目：具體數(shù)值受測(cè)序類型、樣本量、過濾方式等影響，如果數(shù)量級(jí)差得夸張，就說明有問題啦。
TiTv Ratio： ratio of transition (Ti) to transversion (Tv) SNPs。理論上，如果是隨機(jī)突變的話，比例是0.5.但是，生物學(xué)上經(jīng)常會(huì)發(fā)生甲基化，C變成T，因此Ti突變更多。而且，CpG island經(jīng)常出現(xiàn)在primer取悅，C甲基化較多，因此WES數(shù)據(jù)更偏向Ti,比例一般在3.0-3.3。

這里就可以自己統(tǒng)計(jì)啦~VCF文件的處理推薦BCFtools,VCFtools，或者自己動(dòng)手豐衣食足。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

從fastq數(shù)據(jù)到SNV | GATK

從fastq數(shù)據(jù)到SNV | GATK

00 寫在前面

01 軟件準(zhǔn)備

02 數(shù)據(jù)庫準(zhǔn)備

03 測(cè)序數(shù)據(jù)質(zhì)量評(píng)估及質(zhì)控

03-1 質(zhì)量評(píng)估Fastqc, multiqc

03-2 去除INDEX和接頭

03-3 去除低質(zhì)量堿基，過濾低質(zhì)量reads

03-4 再次質(zhì)量評(píng)估

04 SNV calling

04-1 數(shù)據(jù)預(yù)處理: map, sort, mark duplicates, base recalibration

給參考基因組建立索引

比對(duì)到參考基因組,生成SAM文件

轉(zhuǎn)換成BAM文件，sort, mark duplicates

Recalibrate base quality scores = run BQSR

04-2 檢測(cè)SNV: calling, consolidate, filter(VQSR/hard filter)

Call variants per-sample with `HaplotypeCaller` (in GVCF mode)

Consolidate GVCFs

Filter Variants by Variant (Quality Score) Recalibration:`VariantRecalibrator` `ApplyRecalibration`

05 SNV初步統(tǒng)計(jì)分析

05-1從VCF中提取子集：`SelectVariants`

05-2 精煉：`CalculateGenotypePosteriors` `VariantFiltration` `VariantAnnotator` `Funcotator (experimental)`

05-3 結(jié)果評(píng)估

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

從fastq數(shù)據(jù)到SNV | GATK

00 寫在前面

01 軟件準(zhǔn)備

02 數(shù)據(jù)庫準(zhǔn)備

03 測(cè)序數(shù)據(jù)質(zhì)量評(píng)估及質(zhì)控

03-1 質(zhì)量評(píng)估Fastqc, multiqc

03-2 去除INDEX和接頭

03-3 去除低質(zhì)量堿基，過濾低質(zhì)量reads

03-4 再次質(zhì)量評(píng)估

04 SNV calling

04-1 數(shù)據(jù)預(yù)處理: map, sort, mark duplicates, base recalibration

給參考基因組建立索引

比對(duì)到參考基因組,生成SAM文件

轉(zhuǎn)換成BAM文件，sort, mark duplicates

Recalibrate base quality scores = run BQSR

04-2 檢測(cè)SNV: calling, consolidate, filter(VQSR/hard filter)

Call variants per-sample with HaplotypeCaller (in GVCF mode)

Consolidate GVCFs

Filter Variants by Variant (Quality Score) Recalibration:VariantRecalibrator ApplyRecalibration

05 SNV初步統(tǒng)計(jì)分析

05-1從VCF中提取子集：SelectVariants

05-2 精煉：CalculateGenotypePosteriors VariantFiltration VariantAnnotator Funcotator (experimental)

05-3 結(jié)果評(píng)估

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

轉(zhuǎn)換成BAM文件，sort, mark duplicates

Call variants per-sample with `HaplotypeCaller` (in GVCF mode)

Filter Variants by Variant (Quality Score) Recalibration:`VariantRecalibrator` `ApplyRecalibration`

05-1從VCF中提取子集：`SelectVariants`

05-2 精煉：`CalculateGenotypePosteriors` `VariantFiltration` `VariantAnnotator` `Funcotator (experimental)`