實(shí)操一個(gè)WGS項(xiàng)目,從原始測(cè)序數(shù)據(jù)到比對(duì),再到call變異,以及vcf文件的質(zhì)控。
軟件準(zhǔn)備
- samtools
- sratoolskit
- bcftools
- htslib(bcftools源碼下載后目錄下會(huì)自帶對(duì)應(yīng)版本的htslib)
- bgzip/tabix(安裝htslib后即可使用這兩個(gè)工具)
- bwa
- GATK4(自帶Picard)
實(shí)操
1. 參考基因組準(zhǔn)備
# 下載
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
# 解壓重命令
gzip -dc GCF_000005845.2_ASM584v2_genomic.fna.gz >E.coli_K12_MG1655.fa
# 簡(jiǎn)歷索引
samtools faidx E.coli_K12_MG1655.fa
# samtools示例:快速查看基因組中特定序列位置的堿基
samtools faidx E.coli_K12_MG1655.fa NC_000913.3:1000000-1000200
2. 測(cè)序raw reads準(zhǔn)備
# 下載數(shù)據(jù),fastq-dump是sratoolskit中的工具
fastq-dump --split-files SRR1770413
#壓縮數(shù)據(jù),減少數(shù)據(jù)量
bgzip -f SRR1770413_1.fastq
bgzip -f SRR1770413_2.fastq
3. 比對(duì)
#建立索引
bwa index E.coli_K12_MG1655.fa
#1 比對(duì)
bwa mem -t 4 -R '@RG\tID:foo\tPL:illumina\tSM:E.coli_K12' ../input/E.coli_K12_MG1655.fa ../input/SRR1770413_1.fastq.gz ../input/SRR1770413_2.fastq.gz | samtools view -Sb - > ../output/E_coli_K12.bam && echo "** bwa mapping done **"
#2 排序
samtools sort -@ 4 -m 4G -O bam -o ../output/E_coli_K12.sorted.bam ../output/E_coli_K12.bam && echo "** BAM sort done"
rm -f ../output/E_coli_K12.bam
#3 標(biāo)記PCR重復(fù)
gatk MarkDuplicates -I ../output/E_coli_K12.sorted.bam -O ../output/E_coli_K12.sorted.markdup.bam -M ../output/E_coli_K12.sorted.markdup_metrics.txt && echo "** markdup done **"
#4 刪除不必要文件(可選)
rm -f ../output/E_coli_K12.bam
rm -f ../output//E_coli_K12.sorted.bam
#5 創(chuàng)建比對(duì)索引文件
samtools index ../output/E_coli_K12.sorted.markdup.bam && echo "** index done **"
4.變異檢測(cè)
# 為參考基因組構(gòu)建dict,前Picard的功能
gatk CreateSequenceDictionary -R E.coli_K12_MG1655.fa -O E.coli_K12_MG1655.dict && echo "** dict done **"
#1 生成中間文件gvcf
gatk HaplotypeCaller \
-R ../input/E.coli_K12_MG1655.fa \
--emit-ref-confidence GVCF \
-I ../output/E_coli_K12.sorted.markdup.bam \
-O ../output/E_coli_K12.g.vcf && echo "** gvcf done **"
#2 通過(guò)gvcf檢測(cè)變異
gatk GenotypeGVCFs \
-R ../input/E.coli_K12_MG1655.fa \
-V ../output/E_coli_K12.g.vcf \
-O ../output/E_coli_K12.vcf && echo "** vcf done **"
5.vcf預(yù)處理
#1 壓縮
bgzip -f ../output/E_coli_K12.vcf
#2 構(gòu)建tabix索引
tabix -p vcf ../output/E_coli_K12.vcf.gz
6.變異過(guò)濾
# 使用SelectVariants,選出SNP
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk SelectVariants \
-select-type SNP \
-V ../output/E_coli_K12.vcf.gz \
-O ../output/E_coli_K12.snp.vcf.gz
# 為SNP作硬過(guò)濾
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk VariantFiltration \
-V ../output/E_coli_K12.snp.vcf.gz \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "Filter" \
-O ../output/E_coli_K12.snp.filter.vcf.gz
# 使用SelectVariants,選出Indel
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk SelectVariants \
-select-type INDEL \
-V ../output/E_coli_K12.vcf.gz \
-O ../output/E_coli_K12.indel.vcf.gz
# 為Indel作過(guò)濾
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk VariantFiltration \
-V ../output/E_coli_K12.indel.vcf.gz \
--filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "Filter" \
-O ../output/E_coli_K12.indel.filter.vcf.gz
# 重新合并過(guò)濾后的SNP和Indel
time /hwfssz4/BC_PUB/pipeline/DNA/DNA_Human_WGS/DNA_Human_WGS_2018a/bin//gatk MergeVcfs \
-I ../output/E_coli_K12.snp.filter.vcf.gz \
-I ../output/E_coli_K12.indel.filter.vcf.gz \
-O ../output/E_coli_K12.filter.vcf.gz
# 刪除無(wú)用中間文件
rm -f ../output/E_coli_K12.snp.vcf.gz* ../output/E_coli_K12.snp.filter.vcf.gz* ../output/E_coli_K12.indel.vcf.gz* ../output/E_coli_K12.indel.filter.vcf.gz*
結(jié)果
輸出文件:

image.png
過(guò)濾后的vcf文件:

image.png
代碼實(shí)現(xiàn)只是一個(gè)簡(jiǎn)單過(guò)程,其中每一步的原理、軟件使用參數(shù)及結(jié)果的理解才是重點(diǎn)。參考:
GATK4.0和全基因組數(shù)據(jù)分析實(shí)踐(上)
GATK4.0和全基因組數(shù)據(jù)分析實(shí)踐(下)