在之前學習snakemake的基礎上,這里重新又對之前編寫的pipline進行了優(yōu)化,只需要在config.yaml中輸入相應的samples,以及參考基因組路徑,即可以得到測序結果比對的最終count數據。
參考的官方文檔
snakemake+ R/Python script
主要分為以下幾個階段:
1.比對生成sam文件:保存在bwa/文件夾下
2.sam轉為bam: 保存在bam/文件夾下
3. 對bam文件進行sort: 保存在sort/文件夾下
4. 對bam文件進行count: 保存在count/文件夾下
其中還有兩個必需的文件夾:
samples文件夾:保存待分析的測序原始數據
reference文件夾:保存參考基因組
這個是分析的流程圖
流程圖
配置文件config.yaml
samples:
control_R1: samples/control_R1.fq.gz
control_R2: samples/control_R2.fq.gz
treated_R1: samples/treated_R1.fq.gz
treated_R2: samples/treated_R2.fq.gz
reference:
bwaRef: reference/chr19
annotation:
gtf: annotation/mouse.chr.gtf
待分析數據、參考基因組
tree samples/ reference/
samples/
├── control_R1.fq.gz
├── control_R2.fq.gz
├── nohup.out
├── treated_R1.fq.gz
├── treated_R2.fq.gz
├── wget-log
└── wget-log.1
reference/
├── chr19.1.bt2
├── chr19.2.bt2
├── chr19.3.bt2
├── chr19.4.bt2
├── chr19.fa
├── chr19.fa_bowtie2-build.log
├── chr19.fa.fai
├── chr19.rev.1.bt2
├── chr19.rev.2.bt2
├── GRCm38.83.chr19.gtf
└── nohup.out
1.fastqc 質控
rule quality_check:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"qc/"
shell:
"fastqc -o {output} {input}"
在這次分析中沒成功,單獨分析也沒成功,可能是我服務器的原因,所以這一步雖然在流程中,但是分析時被跳過了
2. bowtie2對比
rule bwa:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"bwa/{sample}.sam"
params:
INDEX= config["reference"]["bwaRef"] #參考基因組路徑
shell:
"bowtie2 -x {params.INDEX} -q {input} -S {output}"
3. samtools轉化sam文件
input:
"bwa/{sample}.sam"
output:
"bam/{sample}.bam"
shell:
"samtools view -Sb {input} > {output}"
4. sort bam文件
rule sort:
input:
"bam/{sample}.bam"
output:
"sort/{sample}_sort.bam"
shell:
"samtools sort {input} -o {output}"
5. 建立bam文件的index
rule index:
input:
"sort/{sample}_sort.bam"
output:
"sort/{sample}.bam.bi"
shell:
"samtools index {input} {output}"
6. htseq-count對bam文件進行count
rule count:
input:
sample="sort/{sample}_sort.bam",
annotation=config["annotation"]["gtf"]
output:
"count/{sample}.count"
shell:
"htseq-count -f bam -r name -s no {input.sample} {input.annotation} > {output}"
7.采用python對各樣本的count結果進行合并,然而這個卻是在整個snakemake 文件的最開始
rule all:
input:
expand("count/{sample}.count",sample=config["samples"]) #進行遍歷,告訴它所有sample的count文件
output:
"count/merge.count" #輸出一個merge之后的文件
run:
import os
import pandas as pd
dm=pd.DataFrame() #建立一個空dataframe
for filename in input: #讀取各個文件的全路徑
df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
if(any(dm)):#進行合并
dm=pd.merge(dm,df,on="gene")
else:
dm=df
dm.to_csv(output[0],sep='\t')
#將合并之后文件輸出到output路徑,要使用output[0],或者output.index
8.最終整個snakefile文件如下
cat snakefile1
configfile: "config.yaml"
rule all:
input:
expand("count/{sample}.count",sample=config["samples"])
output:
"count/merge.count"
run:
import os
import pandas as pd
dm=pd.DataFrame()
for filename in input:
df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
if(any(dm)):
dm=pd.merge(dm,df,on="gene")
else:
dm=df
dm.to_csv(output[0],sep='\t')
#rule all:
# input:
# "count/merge.count"
# expand("count/{sample}.count",sample=config["samples"])
rule quality_check:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"qc/"
shell:
"fastqc -o {output} {input}"
rule bwa:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"bwa/{sample}.sam"
params:
INDEX= config["reference"]["bwaRef"]
shell:
"bowtie2 -x {params.INDEX} -q {input} -S {output}"
rule bam:
input:
"bwa/{sample}.sam"
output:
"bam/{sample}.bam"
shell:
"samtools view -Sb {input} > {output}"
rule sort:
input:
"bam/{sample}.bam"
output:
"sort/{sample}_sort.bam"
shell:
"samtools sort {input} -o {output}"
rule index:
input:
"sort/{sample}_sort.bam"
output:
"sort/{sample}.bam.bi"
shell:
"samtools index {input} {output}"
rule count:
input:
sample="sort/{sample}_sort.bam",
annotation=config["annotation"]["gtf"]
output:
"count/{sample}.count"
shell:
"htseq-count -f bam -r name -s no {input.sample} {input.annotation} > {output}"
生成的結果是這個樣子:
bam/:
total 181M
-rw-r--r-- 1 root root 49M Jul 2 11:20 control_R1.bam
-rw-r--r-- 1 root root 49M Jul 2 11:21 control_R2.bam
-rw-r--r-- 1 root root 42M Jul 2 11:21 treated_R1.bam
-rw-r--r-- 1 root root 43M Jul 2 11:21 treated_R2.bam
bwa/:
total 640M
-rw-r--r-- 1 root root 173M Jul 2 11:16 control_R1.sam
-rw-r--r-- 1 root root 172M Jul 2 11:15 control_R2.sam
-rw-r--r-- 1 root root 148M Jul 2 11:18 treated_R1.sam
-rw-r--r-- 1 root root 148M Jul 2 11:17 treated_R2.sam
count/:
total 5.3M
-rw-r--r-- 1 root root 965K Jul 2 11:30 control_R1.count
-rw-r--r-- 1 root root 965K Jul 2 11:28 control_R2.count
-rw-r--r-- 1 root root 1.5M Jul 3 20:27 merge.count
-rw-r--r-- 1 root root 965K Jul 2 11:36 treated_R1.count
-rw-r--r-- 1 root root 965K Jul 2 11:33 treated_R2.count
sort/:
total 131M
-rw-r--r-- 1 root root 35M Jul 2 11:25 control_R1_sort.bam
-rw-r--r-- 1 root root 35M Jul 2 11:25 control_R2_sort.bam
-rw-r--r-- 1 root root 31M Jul 2 11:25 treated_R1_sort.bam
-rw-r--r-- 1 root root 31M Jul 2 11:25 treated_R2_sort.bam
嗯,還算比較滿意,接下來會繼續(xù)完善這個流程。