KneadData是一款宏基因組測(cè)序數(shù)據(jù)質(zhì)控的軟件,其主要功能包括使用Trimmomatic對(duì)序列過濾和bowtie2比對(duì)至宿主基因組去除宿主序列。今天我們使用這款軟件來計(jì)算宏基因組測(cè)序數(shù)據(jù)中來自于人基因組的量
conda安裝kneaddata
conda install kneaddata
查看直接下載就可使用的數(shù)據(jù)庫(kù)
kneaddata_database --available
KneadData Databases ( database : build = location )
human_genome : bmtagger = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_BMTagger_v0.1.tar.gz
human_genome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
mouse_C57BL : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/mouse_C57BL_6NJ_Bowtie2_v0.1.tar.gz
human_transcriptome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz
ribosomal_RNA : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.1.tar.gz
下載人基因組數(shù)據(jù)庫(kù)
kneaddata_database --download human_genome bowtie2 .
Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
Downloading file of size: 3.44 GB
kneaddata -h 顯示幫助
usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR
[-db REFERENCE_DB] [--bypass-trim]
[--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
[-q {phred33,phred64}] [--run-bmtagger] [--run-trf]
[--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
[--remove-intermediate-output] [--cat-final-output]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
[--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
[--trimmomatic-options TRIMMOMATIC_OPTIONS]
[--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
[--no-discordant] [--cat-pairs] [--reorder] [--serial]
[--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
[--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
[--minscore MINSCORE] [--maxperiod MAXPERIOD]
[--fastqc FASTQC_PATH]
KneadData
optional arguments:
-h, --help show this help message and exit
-v, --verbose additional output is printed
global options:
--version show program's version number and exit
-i INPUT, --input INPUT
input FASTQ file (add a second argument instance to run with paired input files)
-o OUTPUT_DIR, --output OUTPUT_DIR
directory to write output files
-db REFERENCE_DB, --reference-db REFERENCE_DB
location of reference database (additional arguments add databases)
--bypass-trim bypass the trim step
--output-prefix OUTPUT_PREFIX
prefix for all output files
[ DEFAULT : OUTPUT_DIR/$SAMPLE_kneaddata.log ]
trimmomatic arguments:
--trimmomatic TRIMMOMATIC_PATH
path to trimmomatic
[ DEFAULT : $PATH ]
--max-memory MAX_MEMORY
max amount of memory
[ DEFAULT : 500m ]
--trimmomatic-options TRIMMOMATIC_OPTIONS
options for trimmomatic
[ DEFAULT : SLIDINGWINDOW:4:20 MINLEN:70 ]
MINLEN is set to 70 percent of total input read length
bowtie2 arguments:
--bowtie2 BOWTIE2_PATH
path to bowtie2
[ DEFAULT : $PATH ]
--bowtie2-options BOWTIE2_OPTIONS
options for bowtie2
[ DEFAULT : --very-sensitive ]
--no-discordant do not include discordant alignments for pairs (ie one of the two pairs aligns)
[ DEFAULT : Discordant alignments are included ]
--cat-pairs concatenate pair files before aligning so reads are aligned as single end
[ DEFAULT : paired reads are aligned as pairs ]
--reorder order the sequences in the same order as the input
[ DEFAULT : With discordant paired alignments sequences are not ordered ]
--serial filter the input in serial for multiple databases so a subset of reads are processed in each database search
bmtagger arguments:
--bmtagger BMTAGGER_PATH
path to BMTagger
[ DEFAULT : $PATH ]
trf arguments:
--trf TRF_PATH path to TRF
[ DEFAULT : $PATH ]
--match MATCH matching weight
[ DEFAULT : 2 ]
--mismatch MISMATCH mismatching penalty
[ DEFAULT : 7 ]
--delta DELTA indel penalty
[ DEFAULT : 7 ]
--pm PM match probability
[ DEFAULT : 80 ]
--pi PI indel probability
[ DEFAULT : 10 ]
--minscore MINSCORE minimum alignment score to report
[ DEFAULT : 50 ]
--maxperiod MAXPERIOD
maximum period size to report
[ DEFAULT : 500 ]
fastqc arguments:
--fastqc FASTQC_PATH path to fastqc
[ DEFAULT : name_1.fastq.gz -i $name_2.fastq.gz -o kneaddata_out --trimmomatic Trimmomatic-0.36/ --remove-intermediate-output -db Homo_sapiens_Bowtie2
--remove-intermediate-output 清理中間文件
-db 人基因組的bowtie2索引文件
--trimmomatic 質(zhì)控程序位置
過濾后結(jié)果統(tǒng)計(jì)
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.out
cat kneaddata_read_counts.out
Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1 decontaminated Homo_sapiens pair2 decontaminated Homo_sapiens orphan1 decontaminated Homo_sapiens orphan2
final pair1 final pair2 final orphan1 final orphan2
kneaddata 72577172.0 72577172.0 49961458.0 49961458.0 20031875.0 955031.0 48388320.0 48388320.0 21348792.0 901878.0 48388320.0 48388320.0 21348792.0 901878.0
在這個(gè)栗子中,宏基因組測(cè)序原始paired-end reads數(shù)為72577172,過濾低質(zhì)量序列后的paired-end reads數(shù)為49961458.0,過濾完人基因組之后的paired-end reads數(shù)為48388320.0。