學(xué)習(xí)WES數(shù)據(jù)分析流程

參考學(xué)習(xí)資料:
筆記
視頻教程
知識(shí)庫(kù)

首先安裝軟件

先安裝conda,使用清華的conda,說(shuō)明書(shū):https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
然后下載安裝miniconda,位置:https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/
在國(guó)內(nèi)需要更改鏡像配置(我租用了一臺(tái)美國(guó)HPC超算中心的服務(wù)器,不用設(shè)置鏡像)
下載安裝軟件之前先搜索是否存在 https://bioconda.github.io/recipes.html
更改鏡像源配置參考賣(mài)萌哥持續(xù)更新的conda教程

然后就可以根據(jù)流程來(lái)使用conda安裝一系列軟件

首先搭建分析環(huán)境

為了做好目錄整理,可以先創(chuàng)建文件夾,以存放各種軟件包、數(shù)據(jù)庫(kù)文件,以及分析過(guò)程中產(chǎn)生的結(jié)果

## 首先在用戶的主目錄下創(chuàng)建 wes_cancer 文件夾作為工作目錄
mkdir ~/wes_cancer
cd ~/wes_cancer
## 在 ~/wes_cancer 中創(chuàng)建 biosoft project data 三個(gè)文件夾
## biosoft 存放軟件安裝包
## project 存放分析過(guò)程產(chǎn)生的文件
## data 存放數(shù)據(jù)庫(kù)文件
mkdir biosoft project data
cd project
## 在 project 文件夾中創(chuàng)建若干個(gè)文件夾,分別存放每一步產(chǎn)生的文件
mkdir -p 0.sra 1.raw_fq 2.clean_fq 3.qc/{raw_qc,clean_qc} 4.align/{qualimap,flagstat,stats} 5.gatk/gvcf 6.mutect 7.annotation/{vep,annovar,funcotator,snpeff} 8.cnv/{gatk,cnvkit,gistic,facet} 9.pyclone 10.signature

2個(gè)重要的軟件安裝相關(guān)問(wèn)題,參考之前的學(xué)習(xí)筆記GATK下載安裝相關(guān),ANNOVAR | 注釋

熟悉參考基因組及必備數(shù)據(jù)庫(kù)

gatk_hg38系列由BWA生成,構(gòu)建索引需要下載很多文件,費(fèi)時(shí)較久:
/bundle/hg38/索引

cd ~/wes_cancer/data/
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz

gatk需要用到的其他文件

因?yàn)楸容^多,可以使用nohup...&的形式將下載的命令都提交到后臺(tái)(如果網(wǎng)絡(luò)不好,可能會(huì)下載失敗,下載后使用前需要檢查)

## gatk
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.fai & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.dict & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz &

其它數(shù)據(jù)庫(kù)文件:

## bed
wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt
cat CCDS.current.txt | grep  "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0"\t0\t+"}'  > hg38.exon.bed

構(gòu)建索引使用的命令:

conda activate wes
cd ~/wes_cancer/data
gunzip Homo_sapiens_assembly38.fasta.gz
time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
cd ~/wes_cancer/project

構(gòu)建索引過(guò)程如下:

(base) root@1100150:~# conda activate wes
(wes) root@1100150:~# cd ~/wes_cancer/data
(wes) root@1100150:~/wes_cancer/data# time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
[bwa_index] Pack FASTA... 52.77 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6434693834, availableWord=464768632
[BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999994 characters processed.
...
[bwt_gen] Finished constructing BWT in 711 iterations.
[bwa_index] 5822.96 seconds elapse.
[bwa_index] Update BWT... 33.52 sec
[bwa_index] Pack forward-only FASTA... 39.86 sec
[bwa_index] Construct SA from BWT and Occ... 1753.57 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p gatk_hg38 /root/wes_cancer/data/Homo_sapiens_assembly38.fasta
[main] Real time: 7785.516 sec; CPU: 7702.678 sec

real    129m45.519s
user    126m56.258s
sys 1m26.421s

完成后的索引文件目錄(大約100G的數(shù)據(jù)庫(kù)文件)
由于單位網(wǎng)絡(luò)屏蔽了遠(yuǎn)程服務(wù)器的IP,暫時(shí)沒(méi)能鏈接上,用了曾老師的教程內(nèi)容

/public/biosoft/GATK/resources/bundle/hg38/bwa_index/
|-- [ 20K]  gatk_hg38.amb
|-- [445K]  gatk_hg38.ann
|-- [3.0G]  gatk_hg38.bwt
|-- [767M]  gatk_hg38.pac
|-- [1.5G]  gatk_hg38.sa
|-- [6.2K]  hg38.bwa_index.log 
`-- [ 566]  run.sh

/public/biosoft/GATK/resources/bundle/hg38/
|-- [1.8G]  1000G_phase1.snps.high_confidence.hg38.vcf.gz
|-- [2.0M]  1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
|-- [3.2G]  dbsnp_146.hg38.vcf.gz
|-- [3.0M]  dbsnp_146.hg38.vcf.gz.tbi
|-- [ 59M]  hapmap_3.3.hg38.vcf.gz
|-- [1.5M]  hapmap_3.3.hg38.vcf.gz.tbi
|-- [568K]  Homo_sapiens_assembly38.dict
|-- [3.0G]  Homo_sapiens_assembly38.fasta
|-- [157K]  Homo_sapiens_assembly38.fasta.fai
|-- [ 20M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
|-- [1.4M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

通過(guò)流量連接上服務(wù)器后查看自己下載的索引及數(shù)據(jù)庫(kù)文件(沒(méi)有分開(kāi)為2個(gè)目錄,且少了hapmap_3.3.hg38.vcf.gz,可能會(huì)對(duì)后續(xù)分析造成影響):

(base) root@1100150:~# cd wes_cancer/data/
(base) root@1100150:~/wes_cancer/data# ll
total 26715496
drwxr-xr-x 2 root root          20 Jun 19 02:15 ./
drwxr-xr-x 5 root root           5 Jun 18 08:37 ../
-rw-r--r-- 1 root root  1888262073 Jun 18 12:28 1000G_phase1.snps.high_confidence.hg38.vcf.gz
-rw-r--r-- 1 root root     2128536 Jun 18 12:25 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root     9839781 Jun 18 12:26 CCDS.current.txt
-rw-r--r-- 1 root root      581712 Jun 18 12:25 Homo_sapiens_assembly38.dict
-rw-r--r-- 1 root root  3249912778 Jun 18 12:25 Homo_sapiens_assembly38.fasta
-rw-r--r-- 1 root root      160928 Jun 18 12:25 Homo_sapiens_assembly38.fasta.fai
-rw-r--r-- 1 root root    20685880 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
-rw-r--r-- 1 root root     1500013 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root  3411143311 Jun 18 12:29 dbsnp_146.hg38.vcf.gz
-rw-r--r-- 1 root root     2466606 Jun 18 12:25 dbsnp_146.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root 15381038028 Jun 18 12:33 funcotator_dataSources.v1.6.20190124s.tar.gz
-rw-r--r-- 1 root root       20199 Jun 19 01:46 gatk_hg38.amb
-rw-r--r-- 1 root root      455474 Jun 19 01:46 gatk_hg38.ann
-rw-r--r-- 1 root root  3217347004 Jun 19 01:45 gatk_hg38.bwt
-rw-r--r-- 1 root root   804336731 Jun 19 01:46 gatk_hg38.pac
-rw-r--r-- 1 root root  1608673512 Jun 19 02:15 gatk_hg38.sa
-rw-r--r-- 1 root root     6813165 Jun 18 12:27 hg38.exon.bed
-rw------- 1 root root    32237249 Jun 18 12:33 nohup.out

第一步是QC

由于我的示例數(shù)據(jù)是clean.fq這一步就省略了
包括使用fasqc和multiqc兩個(gè)軟件查看測(cè)序質(zhì)量,以及使用trim_galore軟件進(jìn)行過(guò)濾低質(zhì)量reads和去除接頭。

mkdir ~/project/boy
wkd=/home/jmzeng/project/boy
mkdir {raw,clean,qc,align,mutation}
cd qc 
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|xargs fastqc -t 10 -o ./

假設(shè)質(zhì)量很差,就過(guò)濾:

### step3: filter the bad quality reads and remove adaptors. 
mkdir $wkd/clean 
cd $wkd/clean

find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 1.fastq.gz > 1
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 2.fastq.gz > 2
paste 1 2  > config
### 打開(kāi)文件 qc.sh ,并且寫(xiě)入內(nèi)容如下: 
source activate wes

bin_trim_galore=trim_galore
dir=$wkd/clean
cat config  |while read id
do
        arr=(${id})
        fq1=${arr[0]}
        fq2=${arr[1]} 
        echo  $dir  $fq1 $fq2 
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir  $fq1 $fq2 & 
done 

source deactivate

讀質(zhì)量較好的測(cè)序數(shù)據(jù)進(jìn)行比對(duì)

先走測(cè)試數(shù)據(jù)

## 先提取小的fq
source activate wes
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._' > fq.txt
cat fq.txt |while read id ;do (zcat $id|head -10000 > $(basename $id ".gz"));done
## 然后一個(gè)個(gè)小fq文件比對(duì)
sample='7E5239'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5239.L1_1.fastq 7E5239.L1_2.fastq  | samtools sort -@ 5 -o 7E5239.bam -
sample='7E5240'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38  7E5240_L1_A001.L1_1.fastq 7E5240_L1_A001.L1_2.fastq  | samtools sort -@ 5 -o 7E5240.bam -
sample='7E5241'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5241.L1_1.fastq 7E5241.L1_2.fastq   | samtools sort -@ 5 -o 7E5241.bam -
## 或者循環(huán)比對(duì)
# 7E5239    7E5239.L1_1.fastq   7E5239.L1_2.fastq
# 7E5240    7E5240_L1_A001.L1_1.fastq   7E5240_L1_A001.L1_2.fastq
# 7E5241    7E5241.L1_1.fastq   7E5241.L1_2.fastq
INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
cat config |while read id
do

arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam - 

done 

模仿這個(gè)例子完成小bam提取和比對(duì):

(base) root@1100150:~/wes_cancer/project# source activate wes
(wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz > fq.txt
(wes) root@1100150:~/wes_cancer/project# cat fq.txt |while read id ; do (zcat $id|head -10000 > $(basename $id ".gz")); done
(wes) root@1100150:~/wes_cancer/project# cat fq.txt
/root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project# ll
total 791
drwxr-xr-x 13 root root     16 Jun 22 06:59 ./
drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
-rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt
(wes) root@1100150:~/wes_cancer/project# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 7E5239.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 5000 sequences (708078 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 2045, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (142, 182, 238)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 430)
[M::mem_pestat] mean and std.dev: (193.44, 69.83)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 526)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 5000 reads in 1.654 CPU sec, 0.360 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq
[main] Real time: 31.238 sec; CPU: 14.275 sec
[bam_sort_core] merging from 0 files and 5 in-memory blocks...
(wes) root@1100150:~/wes_cancer/project# ll
total 1250
drwxr-xr-x 13 root root     17 Jun 22 07:10 ./
drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
-rw-r--r--  1 root root 451049 Jun 22 07:10 7E5239.bam
drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
-rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt

Jun 22生成的就是模仿后的結(jié)果,說(shuō)明這個(gè)過(guò)程是OK的(這里復(fù)制曾老師的代碼有的名稱(chēng)忘記修改了7E5239.bam,后面跑完整數(shù)據(jù)的時(shí)候進(jìn)行修改

如果是樣本很多需要寫(xiě)循環(huán),然后走正常的數(shù)據(jù)

ls /home/jmzeng/project/boy/clean/*1.fq.gz >1
ls /home/jmzeng/project/boy/clean/*2.fq.gz >2
cut -d"/" -f 7 1 |cut -d"_" -f 1  > 0
paste 0 1 2 > config 
source activate wes
INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
cat config |while read id
do

arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
echo $sample $fq1 $fq2 
 bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam -   

done 

而我只有一個(gè)樣本,則只需要像示例數(shù)據(jù)那樣完成即可:

(wes) root@1100150:~# cd wes_cancer/project/2.clean_fq/
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
total 5561163
drwxr-xr-x  2 root root          4 Jun 21 06:48 ./
drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
-rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
-rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 9Y1640.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `9Y1640.R1.clean.fastq'.
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz  | samtools sort -@ 5 -o 9Y1640.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 354436 sequences (50000086 bp)...
[M::process] read 353998 sequences (50000030 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (113, 144879, 0, 173)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (128, 187, 267)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 545)
[M::mem_pestat] mean and std.dev: (197.98, 92.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 684)
...

由于WES數(shù)據(jù)量很大,即使一個(gè)樣本也要運(yùn)行很久,就不記錄全程了(真的太久了,1個(gè)小時(shí)了還沒(méi)有結(jié)束的跡象,以后這種指令還是要放在后臺(tái)運(yùn)行)。
運(yùn)行完成結(jié)果如下:

[M::mem_process_seqs] Processed 63974 reads in 26.351 CPU sec, 5.102 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz
[main] Real time: 5758.605 sec; CPU: 28791.178 sec
[bam_sort_core] merging from 30 files and 5 in-memory blocks...
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
total 8745086
drwxr-xr-x  2 root root          5 Jun 22 10:13 ./
drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
-rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
-rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
-rw-r--r--  1 root root 3261116989 Jun 22 10:13 9Y1640.bam

最簡(jiǎn)單的找變異流程

如果要理解參數(shù)的意思,參考曾老師的 【直播】我的基因組25:用bcftools來(lái)call variation

ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
source activate wes
time samtools mpileup -ugf $ref  *.bam | bcftools call -vmO z -o out.vcf.gz
ls *.bam |xargs -i samtools index {}
## 沒(méi)有去除PCR重復(fù)

去除PCR重復(fù)

理解參數(shù)和教程為什么會(huì)過(guò)時(shí)(生物信息日新月異,要明白代碼的邏輯,會(huì)復(fù)用即可

# samtools markdup -r 7E5241.bam 7E5241.rm.bam
# samtools markdup -S 7E5241.bam 7E5241.mk.bam

完善的GATK流程

source activate wes
GATK=/home/jmzeng/biosoft/gatk4/gatk-4.0.6.0/gatk
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
snp=/public/biosoft/GATK/resources/bundle/hg38/dbsnp_146.hg38.vcf.gz
indel=/public/biosoft/GATK/resources/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz  

for sample in {7E5239.L1,7E5240,7E5241.L1}
do 
echo $sample  
# Elapsed time: 7.91 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates \
    -I $sample.bam \
    -O ${sample}_marked.bam \
    -M $sample.metrics \
    1>${sample}_log.mark 2>&1 

## Elapsed time: 13.61 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation \
    -I ${sample}_marked.bam \
    -O ${sample}_marked_fixed.bam \
    -SO coordinate \
    1>${sample}_log.fix 2>&1 

samtools index ${sample}_marked_fixed.bam

##  17.2 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 2>&1 

$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"   ApplyBQSR \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    -bqsr ${sample}_recal.table \
    -O ${sample}_bqsr.bam \
    1>${sample}_log.ApplyBQSR  2>&1 

## 使用GATK的HaplotypeCaller命令
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
     -R $ref  \
     -I ${sample}_bqsr.bam \
      --dbsnp $snp \
      -O ${sample}_raw.vcf \
      1>${sample}_log.HC 2>&1  

done 

檢查感興趣基因區(qū)域內(nèi)比對(duì)和找變異情況

通過(guò)IGV可視化來(lái)加深自己對(duì)這個(gè)流程的把握和理解。

chr17   HAVANA  gene    43044295        43170245
3.5G Jul 21 18:01 7E5240.bam
7.1G Jul 21 21:40 7E5240_bqsr.bam
4.7G Jul 21 20:28 7E5240_marked.bam
4.8G Jul 21 20:44 7E5240_marked_fixed.bam

把這些bam里面的BRCA1基因的reads拿出來(lái):

samtools  view -h  7E5240.bam chr17:43044295-43170245 |samtools sort -o  7E5240.brca1.bam -
samtools  view -h  7E5240_bqsr.bam chr17:43044295-43170245 |samtools sort -o  7E5240_bqsr.brca1.bam -
samtools  view -h  7E5240_marked.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked.brca1.bam -
samtools  view -h  7E5240_marked_fixed.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked_fixed.brca1.bam -
ls  *brca1.bam|xargs -i samtools index {}

有了這些特定基因區(qū)域的bam,就可以針對(duì)特定基因找變異

source activate wes
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
samtools mpileup -ugf $ref   7E5240_bqsr.brca1.bam   | bcftools call -vmO z -o 7E5240_bqsr.vcf.gz

所有樣本走samtools mpileup 和bcftools call 流程

仍然是參考我的直播基因組 【直播】我的基因組25:用bcftools來(lái)call variation

source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/align 
ls  *_bqsr.bam  |xargs -i samtools index {}
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
nohup samtools mpileup -ugf $ref   *_bqsr.bam | bcftools call -vmO z -o all_bqsr.vcf.gz & 

比對(duì)及找變異結(jié)果的質(zhì)控

raw和clean的fastq文件都需要使用fastqc質(zhì)控。

比對(duì)的各個(gè)階段的bam文件都可以質(zhì)控。

使用qualimap對(duì)wes的比對(duì)bam文件總結(jié)測(cè)序深度及覆蓋度。

source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/clean
ls *.gz |xargs fastqc -t 10 -o ./

cd $wkd/align
rm *_marked*.bam
ls  *.bam  |xargs -i samtools index {} 
ls  *.bam  | while read id ;do (samtools flagstat $id > $(basename $id ".bam").stat);done

conda install bedtools
cat /public/annotation/CCDS/human/CCDS.20160908.txt  |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{print "chr"$0"\t0\t+"}'  > $wkd/align/hg38.exon.bed 

exon_bed=hg38.exon.bed 
ls  *_bqsr.bam | while read id;
do
sample=${id%%.*}
echo $sample
qualimap bamqc --java-mem-size=20G -gff $exon_bed -bam $id  & 
done 

### featureCounts 
gtf=/public/reference/gtf/gencode/gencode.v25.annotation.gtf.gz
featureCounts -T 5 -p -t exon -g gene_id    \
-a  $gtf   *_bqsr.bam -o  all.id.txt  1>counts.id.log 2>&1 & 

比較兩個(gè)找變異工具的區(qū)別

chr1    139213  .       A       G       388     .   
DP=984;VDB=0.831898;SGB=-223.781;RPB=0.599582;MQB=0.0001971;MQSB=0.00457372;BQB=2.59396e-09;MQ0F=0.281504;ICB=0.333333;HOB=0.5;AC=3;AN=6;DP4=371,169,160,84;MQ=21
GT:PL   0/1:198,0,255   0/1:176,0,255   0/1:50,0,158

chr1    139213  rs370723703 A   G   3945.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.999;ClippingRankSum=0.000;DB;DP=244;ExcessHet=3.0103;FS=2.256;MLEAC=1;MLEAF=0.500;MQ=29.33;MQRankSum=-0.929;QD=16.17;ReadPosRankSum=1.462;SOR=0.863  GT:AD:DP:GQ:PL  0/1:136,108:244:99:3974,0,6459
chr1    139213  rs370723703 A   G   2261.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.191;ClippingRankSum=0.000;DB;DP=192;ExcessHet=3.0103;FS=9.094;MLEAC=1;MLEAF=0.500;MQ=32.03;MQRankSum=-0.533;QD=11.78;ReadPosRankSum=0.916;SOR=0.321  GT:AD:DP:GQ:PL  0/1:126,66:192:99:2290,0,7128
chr1    139213  rs370723703 A   G   2445.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.495;ClippingRankSum=0.000;DB;DP=223;ExcessHet=3.0103;FS=10.346;MLEAC=1;MLEAF=0.500;MQ=30.18;MQRankSum=0.486;QD=10.97;ReadPosRankSum=-0.808;SOR=0.300 GT:AD:DP:GQ:PL  0/1:152,71:223:99:2474,0,7901

VCF下游分析

主要是:注釋和過(guò)濾

注釋

VEP,snpEFF,ANNOVAR

1.Annovar使用記錄 (http://www.bio-info-trainee.com/641.html)

2.用annovar對(duì)snp進(jìn)行注釋 (http://www.bio-info-trainee.com/441.html)

3.對(duì)感興趣的基因call variation(http://www.bio-info-trainee.com/2013.html)

4.WES(六)用annovar注釋(http://www.bio-info-trainee.com/1158.html)

cd ~/biosoft/
ln -s /public/biosoft/ANNOVAR/ ANNOVAR
source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/mutation 
 mv ../align/*.vcf ./

 ~/biosoft/ANNOVAR/annovar/convert2annovar.pl  -format vcf4old    7E5240_raw.vcf  >7E5240.annovar

~/biosoft/ANNOVAR/annovar/annotate_variation.pl \
-buildver hg38  \
--outfile 7E5240.anno \
7E5240.annovar \
~/biosoft/ANNOVAR/annovar/humandb/

其實(shí)并不一定需要vcf文件,軟件需要的只是染色體加上坐標(biāo)即可,對(duì)于我們的vcf格式的變異文件, 軟件通常會(huì)進(jìn)行一定程度的格式化之后再進(jìn)行注釋。這里的注釋主要有三種方式,分別是:

如果要使用annovar一次性注釋多個(gè)數(shù)據(jù)庫(kù)

dir=/home/jianmingzeng/biosoft/ANNOVAR/annovar
db=$dir/humandb/ 
ls $db

wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180603.vcf.gz 

perl $dir/annotate_variation.pl  --downdb --webfrom annovar --buildver hg38 
clinvar_20180603 $db
# perl $bin  --downdb --webfrom annovar --buildver hg38 gnomad_genome $db
mkdir annovar_results 
 $dir/convert2annovar.pl -format vcf4old highQ.vcf  1> highQ.avinput  2>/dev/null 
perl $dir/annotate_variation.pl  -buildver hg38 -filter -dbtype clinvar_20180603  --outfile annovar_results/highQ_clinvar  highQ.avinput  $db

perl $dir/table_annovar.pl   \
-buildver hg38 \
highQ.avinput  $db \
-out test \
-remove -protocol \
refGene,clinvar_20170905 \
-operation g,r \
-nastring NA 

補(bǔ)充作業(yè)

使用 Variant Effect Predictor 對(duì)所有遺傳變異進(jìn)行注釋。過(guò)濾掉 dbSNP 數(shù)據(jù)庫(kù)和千人基因組計(jì)劃數(shù)據(jù)庫(kù)中已知的 SNP。

應(yīng)用 OMIM 數(shù)據(jù)庫(kù)(http://omim.org/)查詢(xún)蛋白 的結(jié)構(gòu)及功能。利用 SIFT ,PolyPhen-2 以及 PROVEAN 軟件, 預(yù)測(cè) SNV 對(duì)蛋白質(zhì)功能的影響 程度,僅當(dāng) 3 種軟件均預(yù)測(cè)同一遺傳變異對(duì)蛋白質(zhì) 的功能影響較大時(shí),才認(rèn)定該遺傳變異具有高危害 性。利用 PROVEAN 軟件 預(yù)測(cè) Indel 對(duì)蛋白質(zhì)功 能的影響。

其實(shí)dbNSFP數(shù)據(jù)庫(kù),就注釋了這些變異對(duì)蛋白功能的影響。

變異位點(diǎn)的過(guò)濾

使用 GATK的 Joint Calling , 過(guò)濾參考:https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng

注意很多資料會(huì)過(guò)時(shí)

比如雖然可以找到了gatk3代碼:http://baserecalibrator1.rssing.com/chan-10751514/all_p13.html 但是已經(jīng)可以拋棄了,也就是說(shuō)教程經(jīng)常會(huì)過(guò)時(shí)。

GVCF 教程

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容