性染色體的鑒定(GATK4)

寫在前面

性別被稱為“進化生物學(xué)問題的皇后”,一直是生命科學(xué)領(lǐng)域中最具吸引力和最熱門研究方向之一。而有性生物其性染色體的解析是性別研究的基礎(chǔ),這篇文章主要介紹一套鑒定性染色體的流程。有性生物的性別決定系統(tǒng)主要有兩種:雄性性別決定系統(tǒng)(XX/XY)和雌性性別決定系統(tǒng)(ZZ/ZW)。因此需要鑒定性染色體X/Y(Z/W)到底是基因組的哪條染色體。目前主流鑒定方式有兩種:
一是利用雌雄測序reads(二代三代都行)在基因組上的覆蓋度差異來區(qū)分常染色體和性染色染體,這種方式適合X和Y(或Z和W)分化差異較大的時候。如果以YY基因組為參考,雄性reads在Y特異的區(qū)域的覆蓋度是在常染色體/假常染色體(PAR)的一半,而雌性reads而則在Y特異的區(qū)域完全沒有覆蓋度;同樣地如果以XX基因組為參考,雄性reads在X特異的區(qū)域的覆蓋度也是在常染色體/假常染色體(PAR)的一半,而雌性reads而則在X特異的區(qū)域的區(qū)域與常染色體/假常染色體(PAR)沒有區(qū)別。

Fig. 2A, https://doi.org/10.1073/pnas.2121469119

二是利用X和Y(或Z和W)同源重組抑制產(chǎn)生的大量的SNP來鑒定,我們將一定數(shù)量的雌雄個人重測序reads比對到參考基因組 ,獲得了各個樣本的SNP,篩選在符合要求的SNP,看看其主要位于哪條染色體上。比如我們?nèi)绻訷Y基因組為參考,就可以把所有雄性個體其基因型為雜合突變(0/1)而雌性個體其基因型為純合突變(1/1)的SNP位點篩選出來,看他們位于哪里;同樣地如果以XX基因組為參考,就需要篩出所有雄性個體其基因型為雜合突變(0/1)而雌性個體其基因型為未突變(0/0)的SNP位點。
Fig 2C,https://doi.org/10.1371/journal.pgen.1008013

這次我用第二種方式,用雌雄覆蓋度下次有空再學(xué)習(xí)下!

大致的流程:

對比: 用bwa對重測序數(shù)據(jù)進行比對,拿到個樣本的bam文件;
snp calling: 對拿到的bam用picard標記重復(fù),隨后將標重的bam用gatk去call SNP,拿到每個樣本的gvcf文件;
joint calling: 將所有樣本的gvcf還是用gatk合并成一個gvcf并聯(lián)合分型生成vcf文件;
snp篩選并作圖: 對vcf文件進行過濾,并篩選出符合要求的snp并作圖展示出來。

1.比對

相關(guān)軟件:
bwa,ParaFly,samtools。 ParaFly可以批量并行命令,因為這里有多個樣本需要比對,彼此之間是獨立的可以并行。也可以不裝,可以依次進行單個樣本比對,或者同時提交多個任務(wù),就是麻煩點。
bwa主頁:https://github.com/lh3/bwa
ParaFly主頁:https://parafly.sourceforge.net/

## bwa手動安裝
git clone https://github.com/lh3/bwa.git
cd bwa
make

## ParaFly自動安裝
conda install -c bioconda parafly

#1.對參考基因組建立bwa以及samtools引索index

ln -s ../ref/YY2.fa caYY.fa
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools faidx    caYY.fa
/newlustre/home/jfgui/Wangtao/software/bwa/bwa index  caYY.fa

#2.生成所有樣本的比對列表文件mapping.cmd.list
當(dāng)然生成mapping.list方法有很多,按照自己方法來就行,實在不行久就手打所有命令行。
方法一:先生成一個所有樣本的樣本名稱文件sample.list,再用awk print生成一個所有樣本的比對列表mapping.cmd.list。

#生成樣本名稱文件
ls ../reseq-data/ca-mc/*_1.* | xargs -n 1 basename | sed 's/_1.fq.gz//'  > sample.list  # 用ls命令加basename導(dǎo)出一個只含有R1端樣本的名稱文件,再用sed命令把后面的字符替換掉。
#生成比對列表文件 (這里需要注意的點是-R參數(shù)外面的單引號和\t需要反斜杠來轉(zhuǎn)義)
awk '{print "/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '\''@RG\\tID:"$1"\\tSM:"$1"\\tLB:"$1"\\tPL:ILLUMINA'\'' caYY.fa ../reseq-data/ca-mc/"$1"_1.fq.gz ../reseq-data/ca-mc/"$1"_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/"$1".bam"}' sample.list > mapping.cmd.list

#####################################
示例結(jié)果如下:
$cat sample.list
female1-728
female2-733
female3-735
female4-736
female5-737
female6-755
female7-729
female8-730
female9-732
male1-752
male2-753
male3-738
male4-739
male5-740
male6-746
male7-750
male8-747
male9-748
$cat mapping.cmd.list
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female1-728\tSM:female1-728\tLB:female1-728\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female1-728_1.fq.gz ../reseq-data/ca-mc/female1-728_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female1-728.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female2-733\tSM:female2-733\tLB:female2-733\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female2-733_1.fq.gz ../reseq-data/ca-mc/female2-733_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female2-733.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female3-735\tSM:female3-735\tLB:female3-735\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female3-735_1.fq.gz ../reseq-data/ca-mc/female3-735_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female3-735.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female4-736\tSM:female4-736\tLB:female4-736\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female4-736_1.fq.gz ../reseq-data/ca-mc/female4-736_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female4-736.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female5-737\tSM:female5-737\tLB:female5-737\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female5-737_1.fq.gz ../reseq-data/ca-mc/female5-737_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female5-737.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female6-755\tSM:female6-755\tLB:female6-755\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female6-755_1.fq.gz ../reseq-data/ca-mc/female6-755_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female6-755.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female7-729\tSM:female7-729\tLB:female7-729\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female7-729_1.fq.gz ../reseq-data/ca-mc/female7-729_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female7-729.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female8-730\tSM:female8-730\tLB:female8-730\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female8-730_1.fq.gz ../reseq-data/ca-mc/female8-730_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female8-730.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female9-732\tSM:female9-732\tLB:female9-732\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female9-732_1.fq.gz ../reseq-data/ca-mc/female9-732_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/female9-732.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male1-752\tSM:male1-752\tLB:male1-752\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male1-752_1.fq.gz ../reseq-data/ca-mc/male1-752_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male1-752.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male2-753\tSM:male2-753\tLB:male2-753\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male2-753_1.fq.gz ../reseq-data/ca-mc/male2-753_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male2-753.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male3-738\tSM:male3-738\tLB:male3-738\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male3-738_1.fq.gz ../reseq-data/ca-mc/male3-738_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male3-738.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male4-739\tSM:male4-739\tLB:male4-739\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male4-739_1.fq.gz ../reseq-data/ca-mc/male4-739_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male4-739.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male5-740\tSM:male5-740\tLB:male5-740\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male5-740_1.fq.gz ../reseq-data/ca-mc/male5-740_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male5-740.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male6-746\tSM:male6-746\tLB:male6-746\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male6-746_1.fq.gz ../reseq-data/ca-mc/male6-746_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male6-746.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male7-750\tSM:male7-750\tLB:male7-750\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male7-750_1.fq.gz ../reseq-data/ca-mc/male7-750_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male7-750.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male8-747\tSM:male8-747\tLB:male8-747\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male8-747_1.fq.gz ../reseq-data/ca-mc/male8-747_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male8-747.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male9-748\tSM:male9-748\tLB:male9-748\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male9-748_1.fq.gz ../reseq-data/ca-mc/male9-748_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/male9-748.bam

方法二(推薦):用bash腳本循環(huán)處理實現(xiàn),腳本如下:

#!/bin/bash

# 循環(huán)處理每個 *_1.fq.gz 文件
fq_directory=../reseq-data/ca-mc
ls $fq_directory/*1.fq.gz | while read i;
do
        base=$(basename $i _1.fq.gz)
    r2=${i%1.fq.gz}2.fq.gz
    # 將 BWA 映射命令追加到比對列表文件
        echo "/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:$base\tSM:$base\tLB:$base\tPL:ILLUMINA' caYY.fa $i $r2 | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24  - > bam/$base.bam "  >> mapping.cmd.list

done

#3.用PareFly軟件批量并行這些比對命令

mkdir bam  ## 新建一個目錄用于放置所有樣本的比對后的.bam文件
ParaFly -c ./mapping.cmd.list -CPU 64

2.snp calling

相關(guān)軟件:
picard,gatk,samtools。
picard主頁:https://github.com/broadinstitute/picard
gatk主頁:https://github.com/broadinstitute/gatk

## picard手動安裝
wget https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar
chmod 777 java -jar picard.jar
java -jar picard.jar
## gatk4手動安裝
wget https://github.com/broadinstitute/gatk/releases/download/4.2.0.0/gatk-4.2.0.0.zip
unzip gatk-4.2.0.0.zip
cd gatk-4.2.0.0/
gatk --help

#1.用Picard生成.dict文件
用Picard的CreateSequenceDictionary命令生成dict字典文件,這個文件是一個index,GATK很多過程都需要。

java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar  CreateSequenceDictionary R=caYY.fa O=caYY.dict

#2.用picard的MarkDuplicates命令標記重復(fù)bam文件
#3.samtolols對以標記重復(fù)的bam文件建立引索
#4.用gatk的HaplotypeCaller命令開始call snp
注1.因為picard是個java軟件,所以可能在運行這個軟件會出現(xiàn)臨時內(nèi)存不足的報錯,我這里遇到了就指定了臨時文件的位置TMP_DIR=。
注2.gatk HaplotypeCaller這個過程非常久,可能是軟件不支持多線程跑得很慢;可用Sentieon替代,但該軟件并不開源。
注3.本來也是想用parafly并行跑這些步驟的,但中間老是報錯,就分開跑了,同樣地用一個bash生成所有樣本的命令,腳本如下

#!/bin/bash
ls bam/*.bam | while read i;
do
        base=$(basename $i .bam)
        echo "java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar MarkDuplicates I=$i O=bam/$base.markdup.bam M=bam/$base.markdup.metrics.txt TMP_DIR=/newlustre/home/jfgui/Wangtao/ca/workdir/" >> $base.call.sh
        echo "/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools index bam/$base.markdup.bam" >> $base.call.sh
        echo "/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk HaplotypeCaller -R caYY.fa -I bam/$base.markdup.bam -O gvcf/$base.g.vcf.gz -ERC GVCF" >> $base.call.sh

 done

隨后依次對每個樣品進行snp calling,下面是其中一個樣本calling命令

java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar MarkDuplicates I=bam/female2-733.bam O=bam/female2-733.markdup.bam M=bam/female2-733.markdup.metrics.txt TMP_DIR=/newlustre/home/jfgui/Wangtao/ca/workdir/
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools index bam/female2-733.markdup.bam
mkdir gvcf
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk HaplotypeCaller -R caYY.fa -I bam/female2-733.markdup.bam -O gvcf/female2-733.g.vcf.gz -ERC GVCF

#########
所有樣本運行完成后,查看中間文件,最后snp calling的結(jié)果是每一個樣本都生成一個gvcf文件
$ll bam
總用量 1.4T
-rw-rw-r-- 1 jfgui jfgui  37G 11月 25 14:29 female1-728.bam
-rw-rw-r-- 1 jfgui jfgui  38G 11月 25 19:36 female1-728.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 25 19:41 female1-728.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 25 19:36 female1-728.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui  36G 11月 25 13:41 female2-733.bam
-rw-rw-r-- 1 jfgui jfgui  36G 11月 26 23:50 female2-733.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 26 23:55 female2-733.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 26 23:50 female2-733.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui  39G 11月 25 14:38 female3-735.bam
-rw-rw-r-- 1 jfgui jfgui  40G 11月 27 00:12 female3-735.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:17 female3-735.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:12 female3-735.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui  40G 11月 25 14:39 female4-736.bam
-rw-rw-r-- 1 jfgui jfgui  40G 11月 27 00:12 female4-736.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:17 female4-736.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:12 female4-736.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui  39G 11月 25 14:42 female5-737.bam
-rw-rw-r-- 1 jfgui jfgui  40G 11月 27 00:11 female5-737.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:16 female5-737.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:11 female5-737.markdup.metrics.txt  .......

$ll gvcf
總用量 89G
-rw-rw-r-- 1 jfgui jfgui 4.8G 11月 30 17:24 female1-728.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 11月 30 17:24 female1-728.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 5.5G 12月  1 03:20 female2-733.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月  1 03:20 female2-733.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.9G 12月  1 04:00 female3-735.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月  1 04:00 female3-735.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.7G 12月  1 00:38 female4-736.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月  1 00:38 female4-736.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.8G 12月  1 05:09 female5-737.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月  1 05:09 female5-737.g.vcf.gz.tbi .......

3. joint calling

相關(guān)軟件:
gatk,bcftools。
bcftools主頁: https://github.com/samtools/bcftools

手動安裝
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
tar -xvjf bcftools-1.18.tar.bz2
cd bcftools-1.18
make
make install

#1.用gatk的CombineGVCFs命令將所有樣本的gvcf合并
#2.用gatk的GenotypeGVCFs命令對合并后單個gvcf進行聯(lián)合變異檢測,生成vcf文件

ls $(pwd)/gvcf/*.g.vcf.gz > gvcf.list  ## 先生成一個所有樣本的gvcf的路徑文件
mkdir vcf
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk CombineGVCFs -R caYY.fa --variant gvcf.list -O vcf/merge.g.vcf.gz
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk GenotypeGVCFs -R caYY.fa -V vcf/merge.g.vcf.gz -O vcf/merge.vcf.gz

#3.對vcf文件進行初步過濾
這一步需要對vcf文件格式有一個了解,進而可以按照自己的需求過濾掉”低質(zhì)量“的變異位點,有很多說得很詳細的文章了,這里不做贅述。此外,很多軟件都能對vcf文件坐過濾可以根據(jù)自己喜愛來,這里用bcftools。

/newlustre/home/jfgui/Wangtao/software/bcftools-1.18/bcftools view -v snps -m2 -M2 -Oz --threads 24 -o vcf/merge.snp.vcf.gz  vcf/merge.vcf.gz  # -v只保留snp變異信息
/newlustre/home/jfgui/Wangtao/software/bcftools-1.18/bcftools view -i 'MAF >= 0.05 && F_MISSING <= 0.2 && INFO/DP > 4' vcf/merge.snp.vcf.gz -Oz -o vcf/merge.filtered.snp.vcf.gz  # -i保留條件參數(shù)

4. snp篩選并作圖

相關(guān)軟件:
python3,bedtools,samtools,R。
python和R后面我從頭學(xué)習(xí)這兩門編程語言再仔細記錄下吧,現(xiàn)在這里就會用腳本并能簡單修改就行,這里用到服務(wù)器的python腳本篩選雄性特異的SNP以及用本地的R/Rstdio畫圖。
bedtools主頁: https://github.com/arq5x/bedtools2

## bedtools自動安裝
conda install -c bioconda bedtools
## bedtools手動安裝
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -xvzf bedtools-2.30.0.tar.gz
cd bedtools-2.30.0
make

#1.用一個python腳本篩出符合條件的snp
我們這里使用YY基因組參考的,因此篩出的SNP的基因組需要滿足:雄魚基因型為0/1,而雌魚的基因型為1/1。我這里用的師兄寫的一個python腳本SexFinder.py,當(dāng)前能力確實還不足以自己寫python腳本或R腳本??。這個腳本把≥8個雄魚樣本的基因型為0/1同時滿足≥8個雌魚的基因型為1/1所有snp挑出了,具體內(nèi)容如下:

import sys
import gzip


def filter_vcf(femalelist, malelist, vcf):
    female = [line.strip() for line in open(femalelist)]
    male = [line.strip() for line in open(malelist)]
    head = ""
    with gzip.open(vcf, "rt", encoding="utf-8") as txt:
        for line in txt:
            if not line.startswith("#"):
                if not head:
                    header = header.strip().split()
                    femalePos = []
                    malePos = []
                    for index, value in enumerate(header[9:], start=9):
                        if value in female:
                            femalePos.append(index)
                        elif value in male:
                            malePos.append(index)
                    head = "done"

                
                line = line.strip().split()
                p = line[6]
                maleP = 0
                femaleP = 0

                for i in malePos:
                    sample = line[i].split(':')
                    tp = sample[0]
                    # dp = int(sample[2])
                    if (tp == "0/1") or (tp == "0|1"):
                        maleP += 1
                for i in femalePos:
                    sample = line[i].split(':')
                    tp = sample[0]
                    # dp = int(sample[2])
                    if (tp == "1/1") or (tp == "1|1"):
                        femaleP += 1

                if (maleP >= 8) and (femaleP >= 8):
                    print("\t".join(line))
            else:
                header = line


def main():
    if len(sys.argv) != 4:
        print('''Usage: for REF YY python3 SexFinder.py <femalelist> <malelist> <vcf>
            for REF XX python3 SexFinder.py <malelist> <femalelist> <vcf>''')
        sys.exit(1)
    filter_vcf(sys.argv[1], sys.argv[2], sys.argv[3])


if __name__ == "__main__":
    main()

運行這個腳本之前需要將雌雄樣本名分別寫入female.listmale.list中:

## 基于前面的sample.list生成female.list和male.list文件
head -n 9 sample.list > female.list
tail -n 9 sample.list > male.list
## 運行python腳本
mkdir sexfinder
python3 SexFinder.py  female.list  male.list  vcf/merge.filtered.snp.vcf.gz  >  sexfinder/sex.snp.vcf  

#2.將篩好vcf文件轉(zhuǎn)換成適合畫圖的bed文件
上一步我們拿到了我們包含雄性特異snp相應(yīng)的vcf文件,但是vcf文件里的信息太多了,事實上我們畫圖只需要snp的位置信息就行了,因此我們只需要保留前兩列染色體號CHROM和 具體位置 POS就夠了;此外為了直觀表現(xiàn)雄性特異snp集中富集在哪里,我們對染色體畫窗口,窗口的位置作為橫坐標,并對每個窗口數(shù)計數(shù),每個窗口內(nèi)包含的snp數(shù)量為縱坐標,以此來畫一個曼哈頓圖。

## 保留vcf文件的前兩列,但打印3列信息,第3列和第2列的信息一致均為snp的位置,這是為了后面取交集用。
awk '{print $1"\t"$2"\t"$2}' sexfinder/sex.snp.vcf > sexfinder/sex.snp.bed
## 對參考基因組建一個samtools faidx引索,其實前面做過了,這里可不做。
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools faidx    caYY.fa
##  基于caYY.fa.fai引索,提取其以Chr開頭的行的前兩個字段,其實也就是生成2兩信息,染色體編號及其長度。
grep "^Chr" caYY.fa.fai | cut -f 1,2 > sexfinder/caYY.fa.size
## 利用bedtools makewindows工具對染色體依次畫50k大小的窗口
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/bedtools makewindows -g sexfinder/caYY.fa.size -w 50000 > sexfinder/caYY.fa.50000.bed
## 利用bedtools  intersect 工具對兩個bed文件取范圍交集并計數(shù)。-a 后接染色體窗口文件,-b后接snp坐標信息文件,-wa以a文件為原始條目,-c對b文件計數(shù)。
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/bedtools intersect -a sexfinder/caYY.fa.50000.bed -b sexfinder/sex.snp.bed -wa -c  > sexfinder/sex.snp.50000.bed

#3.將上一步生成的以窗口計數(shù)snp的bed文件作為輸入在R畫圖
將上一步生成的以窗口計數(shù)snp的bed文件作為導(dǎo)入R中繪制曼哈頓圖,也是師兄給的R腳本,按照自己的需求修改圖片參數(shù)就行。

library(ggplot2)
library(dplyr)
library(readr)
library(stringr)
library(ggpubr)

df <- read.delim("C:/Users/WIIT/Documents/Rworkdir/sex.snp.50000.bed" , header = F)

unique(df$V1)

df$V2 <- df$V2/1000000
df$V3 <- df$V3/1000000

df$V1 <- str_replace(df$V1 , "Chr0" , "") %>% str_replace( "Chr" , "")

df$V1 <- factor(df$V1 , levels = paste0(rep(seq(1,25) , each = 2), rep(c("A","B") , 25)))


df.tmp <- df %>% 
  group_by(V1) %>% 
  summarise(chr_len=max(V3)) %>% 
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  dplyr::select(-chr_len) %>%
  left_join(df, by = c('V1' = 'V1')) %>%
  arrange(V1, V3) %>%
  mutate( BPcum=V3+tot)
axisdf <- df.tmp %>% group_by(V1) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

# comment the line below if you want to show all chromosome on x-axis
axisdf <- axisdf %>% filter(as.numeric(V1)%%2==1)


axisdf_tick <- df.tmp %>% group_by(V1) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

color <- c("grey", "black")

manhattan <- ggplot(df.tmp, aes(x=BPcum, y=V4)) +
  geom_point(aes(color=as.factor(V1)),  size=0.5) +
  scale_color_manual(values = rep(color , 25)) +
  scale_x_continuous( label = axisdf$V1, breaks= axisdf$center ) + 
  theme_pubr(
    legend = "none"
  ) + 
  theme(axis.title = element_text(size = 10),
        axis.text = element_text(size = 7)) + 
  xlab("Chromosome") + 
  ylab("Male-specific SNPs count")

ggsave("C:/Users/WIIT/Documents/Rworkdir/manhattan.pdf" ,plot = manhattan, width = 10 , height = 3)

最終成圖:


其他相關(guān)推薦

VCF文件參數(shù)解讀 - 簡書 (jianshu.com)
討厭又迷人的reads去重復(fù) - 簡書 (jianshu.com)
bedtools intersect用法 (intersectBed) - 簡書 (jianshu.com)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容