嘗試一下jimmy的WES流程，雖然現(xiàn)在云生信勢(shì)頭正猛，做生信的門(mén)檻對(duì)普通人來(lái)說(shuō)降低了，對(duì)專業(yè)人士提高了許多，但是誰(shuí)能阻止我們學(xué)習(xí)的步伐呢？相信那句「沒(méi)有白走的路」，繼續(xù)學(xué)習(xí)，學(xué)習(xí)是我的愛(ài)好和樂(lè)趣。

這是生信技能樹(shù)上帖子的地址，上面有所有代碼，贊！一個(gè)如此熱愛(ài)分享的大神！http://www.biotrainee.com/thread-122-1-1.html

一、軟件和數(shù)據(jù)準(zhǔn)備

1.軟件安裝

我的原則是能用conda解決的不去wget，所以就conda解決了大部分軟件安裝問(wèn)題。我的系統(tǒng)黑蘋(píng)果10.11.5（EI Capitan），遠(yuǎn)景論壇上下載的大神配置好的clover配置文件，基本完美，我們的目的不是裝系統(tǒng)，而是用嘛，所以也不深入研究了。

 thinkpad-E431
 處理器：Intel Core i3-3120M    2*2.49 GHz
 內(nèi)存：12 GB

conda install bwa
#bwa是基于BWT的生物序列比對(duì)工具:bwa是為二代測(cè)序的短序列比對(duì)參考序列（reference，fasta格式）而開(kāi)發(fā)的比對(duì)軟件。
conda install samtools
#samtools是目前廣泛使用額關(guān)于處理bam/sam文件的工具。從測(cè)序儀得到的fastq文件經(jīng)過(guò)mapping后得到二進(jìn)制的bam文件，而sam則是它的十進(jìn)制版本。
conda install gatk
#gatk主要用于從sequencing 數(shù)據(jù)中進(jìn)行variant calling，包括SNP、INDEL。比如現(xiàn)在風(fēng)行的exome sequencing找variant，一般通過(guò)BWA+GATK的pipeline進(jìn)行數(shù)據(jù)分析。
wget https://software.broadinstitute.org/gatk/download/auth?package=GATK
#上面只是安裝了個(gè)gatk-register，仍需要下載gatk用gatk-register安裝
tar -jxvf GenomeAnalysisTK-3.8-0.tar.bz2
cd GenomeAnalysisTK-3.8-0-ge9d806836/
mv GenomeAnalysisTK.jar ../../biosofts/
gatk-register ~/biosofts/GenomeAnalysisTK.jar
#拷貝到目標(biāo)文件夾(安裝目錄)

conda install picard
#picard是一套操作高通量測(cè)序（HTS）數(shù)據(jù)的命令行工具(java)，格式如SAM/BAM/CRAM和VCF。這些文件格式是在Hts規(guī)范存儲(chǔ)庫(kù)中定義的。尤其是SAM規(guī)范和VCF規(guī)范。

2.數(shù)據(jù)準(zhǔn)備

1）模擬產(chǎn)生數(shù)據(jù)

因?yàn)閷?shí)際的數(shù)據(jù)太大了，對(duì)我這種弱弱的計(jì)算機(jī)簡(jiǎn)直是一種折磨，我下過(guò)全部的sra序列，轉(zhuǎn)換成fastq后來(lái)比對(duì)，比對(duì)就是花了幾小時(shí)，什么動(dòng)靜也沒(méi)有。所以，對(duì)于學(xué)習(xí)來(lái)說(shuō)，產(chǎn)生一點(diǎn)模擬數(shù)據(jù)是不錯(cuò)的，jimmy人性化的為我們提供了pirs這個(gè)軟件來(lái)模擬產(chǎn)生測(cè)序數(shù)據(jù)。

我表示裝這個(gè)軟件折騰了良久，不過(guò)終于可以使用了。首先這個(gè)軟件依賴于zlib和boost編譯，我才開(kāi)始嘗試下載源碼編譯，好像是成功了。但是，我不放心，于是用brew安裝了一下測(cè)試了一下。

brew install zlib
brew install boost

然后是pirs這個(gè)軟件的安裝：

大概我只是按步驟裝了下就可以用了，中間有點(diǎn)報(bào)錯(cuò)，是sh文件里有個(gè)路徑錯(cuò)誤，我還以為是軟件的嚴(yán)重錯(cuò)誤。

wget ftp://ftp.genomics.org.cn/pub/pIRS/pIRS_111.tgz #下載軟件

tar zxvf pIRS_111.tgz #解壓

cd pIRS_111/

make #提示說(shuō)沒(méi)有源碼不需要編譯于是嘗試運(yùn)行了一下，竟然可以運(yùn)行。

mkdir -p data reference 
pirs simulate -i reference/chr1.fa \
-s ~/biosofts/pirs/pIRS_111/Profiles/Base-Calling_Profiles/humNew.PE100.matrix.gz \
-d ~/biosofts/pirs/pIRS_111/Profiles/GC-depth_Profiles/humNew.gcdep_200.dat  \
-b ~/biosofts/pirs/pIRS_111/Profiles/InDel_Profiles/phixv2.InDel.matrix \
-l 100 -x 50 -Q 33 -o A

pirs軟件參數(shù)學(xué)習(xí)：

simulate：模擬illumina reads數(shù)據(jù)

-i：參考基因組序列

-s <double> 雜合 SNP rate (default=0.001)

-d <double> GC內(nèi)容覆蓋深度？

-b <string> input InDel-error profile,input InDel-error profile for simulating InDel-error of reads, (default=(exe_path)/Profiles/InDel_Profiles/phixv2.InDel.matrix)

-l:read讀長(zhǎng)

-x:測(cè)序覆蓋深度，默認(rèn)5

-Q:33/64

-o <string> output prefix (default=Illumina)

#程序運(yùn)行過(guò)程：
Start to preview error profile...
In Base-calling profile:
    Ref_Base_num: 4
    Statistical_Cycle_num: 200
    Seq_Base_num: 4
    Quality_num: 41
    100bp reads total average substitution error rate: 0.00372389
Start to construct simulation matrix...
Loading file: /Users/zd200572/biosofts/pirs/pIRS_111/Profiles/Base-Calling_Profiles/humNew.PE100.matrix.gz
Have finished constructing Base-calling simulation matrix1
Start to construct simulation matrix...
Loading file: /Users/zd200572/biosofts/pirs/pIRS_111/Profiles/Base-Calling_Profiles/humNew.PE100.matrix.gz
Have finished constructing Base-calling simulation matrix2
Loading the GC bias profile: /Users/zd200572/biosofts/pirs/pIRS_111/Profiles/GC-depth_Profiles/humNew.gcdep_200.dat

Have finished constructing GC bias simulation matrix
Start to construct InDel-error matrix...
Loading file: /Users/zd200572/biosofts/pirs/pIRS_111/Profiles/InDel_Profiles/phixv2.InDel.matrix

For simulating GC bias:
    GC% abundance
    0   0.00422544
    1   0.00624158
    2   0.00912628
    ...
In InDel-error profile:
    profile_Cycle_num: 200
    read1 count: 147091454
    read2 count: 145785569
    length of max InDel: 3
    insertion-base rate of 100-bp read1: 4.77512e-06
    deletion-base rate of 100-bp read1: 1.11823e-05
    insertion-base rate of 100-bp read2: 8.36749e-06
    deletion-base rate of 100-bp read2: 1.55069e-05
Have finished constructing InDel-error simulation matrix
Have finished reading scaffold chr1
Begin to output reads...
Output 10000 pair reads
Output 20000 pair reads
...
Finish output reads

#由于50x速度實(shí)在生成的太慢，2x的試試，哈哈475s搞定。

In ouput data:
    sum of Insertion in read1 file: 1211
    sum of Deletion in read1 file: 2814
    sum of Insertion in read2 file: 2103
    sum of Deletion in read2 file: 3825
All done! Run time: 475s.
#對(duì)比一下50x的
In ouput data:
    sum of Insertion in read1 file: 30374
    sum of Deletion in read1 file: 70211
    sum of Insertion in read2 file: 52633
    sum of Deletion in read2 file: 95478
All done! Run time: 11248s.

2）真實(shí)的wes數(shù)據(jù)（不是必需）

沒(méi)看到在流程中用到，下下來(lái)以后備用，有100G之巨。測(cè)試了一下，對(duì)于我的電腦配置，單單bwa比對(duì)運(yùn)行要花100個(gè)小時(shí)。。。

# ~100GB disk space required to download the NA12878

wget [url]ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR262/ERR262997/ERR262997_1.fastq.gz[/url]
wget [url]ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR262/ERR262997/ERR262997_2.fastq.gz[/url]

3）參考基因組

cd reference
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
tar zxvf hg38.chromFa.tar.gz
cd chroms
for i in $(seq 1 22) X Y M; do cat chr${i}.fa >> hg19.fasta; done
cp chr1.fa ../
rm -fr chr*.fasta
cd ../

4）基因組index和dictionary的獲得

bwa index -p hg38.chr1 chr1.fa
samtools faidx hg38.fa
picard CreateSequenceDictionary.jar R=hg38.fasta O=hg38.dict

5）其他參考數(shù)據(jù)

# dbSNP138
 wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz

# GATK resource bundle

GATK_FTP=ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg38

# download library file for GATK's VariantRecalibrator

for f in hapmap_3.3.hg38.vcf.gz 1000G_omni2.5.hg38.vcf.gz 1000G_phase1.indels.hg38.vcf.gz Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
do
curl --remote-name ${GATK_FTP}/${f}
done

二、軟件運(yùn)行

1.bwa比對(duì)

由于對(duì)于一個(gè)軟件來(lái)說(shuō)，功能太多，不可能一一覆蓋，于是，我就先學(xué)習(xí)這次用到的參數(shù)，順便對(duì)軟件做一個(gè)大概了解，由于是初學(xué)，目標(biāo)先定低一點(diǎn)，先了解大概，以后做項(xiàng)目再一一學(xué)習(xí)。

bwa的mem參數(shù)：MEM是bwa中最新的算法，基本取代了前兩種，目的是將測(cè)序reads或組裝的contig比對(duì)到Reference上。（Bio-Xin博客：http://www.cnblogs.com/leezx/p/6226717.html）

-t參數(shù)，線程數(shù)；-R參數(shù)：設(shè)置reads標(biāo)頭，“\t”分割；M——將較短的split hits標(biāo)記為secondary，與picard兼容；后邊跟參考基因組、reads文件和>以及要生成的sam文件。（如果GATK call SNP 必須用-r 參數(shù)）

R1=/mnt/A_100_500_1.fq.gz
R2=/mnt/A_100_500_2.fq.gz
#align with BWA, with 4 threads (-t 4). Reading Group is required (-R ...).
bwa mem -t 4 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' $DATA/hg19.chr1/hg19.chr1 ${R1} ${R2} > A.chr1.sam
Real time: 457.713 sec; CPU: 1489.602 sec
#2x的數(shù)據(jù)
 Real time: 12440.568 sec; CPU: 43493.514 sec #50x

2.picard

1）SortSam工具

由BWA生成的sam文件時(shí)按字典式排序法進(jìn)行的排序（lexicographically）進(jìn)行排序的（chr10，chr11…chr19，chr1，chr20…chr22，chr2，chr3…chrM，chrX，chrY），但是GATK在進(jìn)行callsnp的時(shí)候是按照染色體組型（karyotypic）進(jìn)行的（chrM，chr1，chr2…chr22，chrX，chrY），因此要對(duì)原始sam文件進(jìn)行reorder?？梢允褂胮icard-tools中的ReorderSam完成，同時(shí)格式轉(zhuǎn)換成了bam。（https://www.plob.org/article/7009.html）

picard SortSam CREATE_INDEX=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT INPUT=A.chr1.sam OUTPUT=A.chr1.sort.bam
#2x運(yùn)行情況 
picard.sam.SortSam done. Elapsed time: 4.54 minutes.
Runtime.totalMemory()=872415232
#50x

2）MarkDuplicates工具

過(guò)量擴(kuò)增的reads并不是基因組自身固有序列，不能作為變異檢測(cè)的證據(jù)，因此，要盡量去除這些由PCR擴(kuò)增所形成的duplicates，這一步可以使用picard-tools來(lái)完成。去重復(fù)的過(guò)程是給這些序列設(shè)置一個(gè)flag以標(biāo)志它們，方便GATK的識(shí)別。還可以設(shè)置 REMOVE_DUPLICATES=true 來(lái)丟棄duplicated序列。對(duì)于是否選擇標(biāo)記或者刪除，對(duì)結(jié)果應(yīng)該沒(méi)有什么影響，GATK官方流程里面給出的例子是僅做標(biāo)記不刪除。這里定義的重復(fù)序列是這樣的：如果兩條reads具有相同的長(zhǎng)度而且比對(duì)到了基因組的同一位置，那么就認(rèn)為這樣的reads是由PCR擴(kuò)增而來(lái)，就會(huì)被GATK標(biāo)記。

picard MarkDuplicates CREATE_INDEX=true REMOVE_DUPLICATES=True ASSUME_SORTED=True VALIDATION_STRINGENCY=LENIENT METRICS_FILE=/dev/null INPUT=A.chr1.sort.bam OUTPUT=A.chr1.sort.dedup.bam
#2x運(yùn)行情況
picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 5.15 minutes.
Runtime.totalMemory()=822083584

3）FixMateInformation

這一步驟在網(wǎng)上搜到了這個(gè)。。。（看來(lái)我得好好理出一個(gè)現(xiàn)在使用的流程來(lái)，生物信息的流程發(fā)展還真是快呀?。┰谝恍├习姹镜膒ipeline中(如seqanswers上的那個(gè))，這一步做完后還要對(duì)pair-end數(shù)據(jù)的mate information進(jìn)行fix。不過(guò)新版本已經(jīng)不需要這一步了，Indel realign可以自己fix mate。（http://www.bbioo.com/lifesciences/40-115982-1.html）

picard FixMateInformation SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true INPUT=A.chr1.sort.dedup.bam OUTPUT=A.chr1.sort.dedup.mate.bam
#2x運(yùn)行情況
picard.sam.FixMateInformation done. Elapsed time: 6.36 minutes.
Runtime.totalMemory()=877658112

3.gatk

終于到了大頭gatk出場(chǎng)了，我表示我對(duì)它好陌生，幾乎是第一次運(yùn)行它。通過(guò)上面可以看出他是一個(gè)經(jīng)常更新的程序，而且它雖然是一個(gè)java程序，但是它有好幾個(gè)子工具，就像上面的picard。先看看它的流程，官方的：

https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png

1)RealignerTargetCreator

對(duì)INDEL周圍進(jìn)行realignment，這個(gè)操作有兩個(gè)步驟，第一步生成需要進(jìn)行realignment的位置的信息, 輸出一個(gè)包含著possible indels的文件。

GenomeAnalysisTK -R reference/chr1.fa \
-T RealignerTargetCreator \
-I A.chr1.sort.dedup.mate.bam -o A.intervals

2)IndelRealigner

第二步才是對(duì)這些位置進(jìn)行realignment, 最后生成一個(gè)realin好的BAM文件sample.realn.bam

GenomeAnalysisTK -R reference/chr1.fa \
-T IndelRealigner \
-targetIntervals A.intervals \
-I A.chr1.sort.dedup.mate.bam -o A.chr1.sort.dedup.mate.relaign.bam

3)BaseRecalibrator

（對(duì)base的quality score進(jìn)行校正）堿基質(zhì)量分?jǐn)?shù)重校準(zhǔn)（Base quality score recalibration，BQSR)，就是利用機(jī)器學(xué)習(xí)的方式調(diào)整原始?jí)A基的質(zhì)量分?jǐn)?shù)。它分為兩個(gè)步驟:

利用已有的snp數(shù)據(jù)庫(kù)，建立相關(guān)性模型，產(chǎn)生重校準(zhǔn)表( recalibration table)
根據(jù)這個(gè)模型對(duì)原始?jí)A基進(jìn)行調(diào)整，只會(huì)調(diào)整非已知SNP區(qū)域。

（參考自生信媛--GATK之BaseRecalibrator）

GenomeAnalysisTK -R reference/chr1.fa \
-T BaseRecalibrator \
-knownSites $DATA/dbsnp_138.hg19.vcf \
-o A.recal \
-I A.chr1.sort.dedup.mate.relaign.bam

4）UnifiedGenotyper

這一步提示jimmy的代碼不能用了，說(shuō)3.7以后采用了新的參數(shù)，更新好快。

使用UnifiedGenotyper尋找variant。論壇搜索發(fā)現(xiàn)，現(xiàn)在推薦用HaplotypeCaller，效果更好。

GenomeAnalysisTK -T UnifiedGenotyper -R chr1.fa -I A.chr1.sort.dedup.mate.relaign.recal.bam --dbsnp ../common_all_20170710.vcf -o snps.raw.vcf -stand_call_conf 50.0

5）HaplotypeCaller

同樣是更新了的參數(shù)處理方式。

由于HaplotypeCaller的工作原理，直接省去了BQSR和indel realignment步驟，所以對(duì)于一個(gè)variant calling流程而言，可以直接比對(duì)，去重復(fù)后運(yùn)行HaplotypeCaller。

HaplotypeCaller，簡(jiǎn)稱HC，能過(guò)通過(guò)對(duì)活躍區(qū)域（也就是與參考基因組不同處較多的區(qū)域）局部重組裝，同時(shí)尋找SNP和INDEL。換句話說(shuō)，當(dāng)HC看到一個(gè)地方好活躍呀，他就不管之前的比對(duì)結(jié)果了，直接對(duì)這個(gè)地方進(jìn)行重新組裝，所以就比傳統(tǒng)的基于位置（position-based)的工具，如前任UnifiedGenotyper好用多了。

HC可以處理非二倍體物種和混池實(shí)驗(yàn)數(shù)據(jù)，但是不推薦用于癌癥或者體細(xì)胞。
HC還可以正確處理可變剪切，所以也能用于RNA-Seq數(shù)據(jù)。

參考自生信媛--生信媛的GATK大禮包

GenomeAnalysisTK -R chr1.fa -T HaplotypeCaller --dbsnp ../common_all_20170710.vcf -stand_call_conf 30 -o A.hc.vcf -I A.chr1.sort.dedup.mate.relaign.recal.bam

4.總結(jié)

雖然馬馬虎虎走了一次最基本的幾個(gè)步驟，發(fā)現(xiàn)學(xué)習(xí)之路任重而道遠(yuǎn)呀！我只是個(gè)初學(xué)者，加油！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

WES學(xué)習(xí)筆記

WES學(xué)習(xí)筆記

一、軟件和數(shù)據(jù)準(zhǔn)備

1.軟件安裝

2.數(shù)據(jù)準(zhǔn)備

1）模擬產(chǎn)生數(shù)據(jù)

2）真實(shí)的wes數(shù)據(jù)（不是必需）

3）參考基因組

4）基因組index和dictionary的獲得

5）其他參考數(shù)據(jù)

二、軟件運(yùn)行

1.bwa比對(duì)

2.picard

1）SortSam工具

2）MarkDuplicates工具

3）FixMateInformation

3.gatk

1)RealignerTargetCreator

2)IndelRealigner

3)BaseRecalibrator

4）UnifiedGenotyper

5）HaplotypeCaller

4.總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

WES學(xué)習(xí)筆記

一、軟件和數(shù)據(jù)準(zhǔn)備

1.軟件安裝

2.數(shù)據(jù)準(zhǔn)備

1）模擬產(chǎn)生數(shù)據(jù)

2）真實(shí)的wes數(shù)據(jù)（不是必需）

3）參考基因組

4）基因組index和dictionary的獲得

5）其他參考數(shù)據(jù)

二、軟件運(yùn)行

1.bwa比對(duì)

2.picard

1）SortSam工具

2）MarkDuplicates工具

3）FixMateInformation

3.gatk

1)RealignerTargetCreator

2)IndelRealigner

3)BaseRecalibrator

4）UnifiedGenotyper

5）HaplotypeCaller

4.總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

一、軟件和數(shù)據(jù)準(zhǔn)備