本文完全參考自簡(jiǎn)書(shū)文章:http://www.itdecent.cn/p/2671198ae625,本文僅供個(gè)人參考學(xué)習(xí),如有需要請(qǐng)?jiān)L問(wèn)原作者,侵權(quán)請(qǐng)?zhí)嵝褎h除,謝謝!
一、原理
seqtk是一款快速的、輕量級(jí)的fasta或fastq格式文件的處理工具,它可以無(wú)縫處理上述兩種文件,完成日常的如
二、處理方案
1、安裝
conda search seqtk
conda install -y seqtk
2、基本命令及介紹
$seqtk
Usage: seqtk <command> <arguments>
Version: 1.3-r106
Command:
seq common transformation of FASTA/Q # FASTA/Q 的轉(zhuǎn)換
comp get the nucleotide composition of FASTA/Q # 獲取FASTA/Q的核苷酸組成
sample subsample sequences # 獲取樣本序列
subseq extract subsequences from FASTA/Q # 提取子序列
fqchk fastq QC (base/quality summary) # fastq的質(zhì)控
mergepe interleave two PE FASTA/Q files #交叉合并雙端測(cè)序的兩個(gè)FASTA/Q files,合并后的file第一條序列是第一個(gè)fq的第一條,合并后的file第二條是序列是第二個(gè)fq的第一條
trimfq trim FASTQ using the Phred algorithm # 用Phred算法對(duì)fq修剪
hety regional heterozygosity # 區(qū)域性雜合
gc identify high- or low-GC regions# 識(shí)別高低GC含量的區(qū)域
mutfa point mutate FASTA at specified positions#在特定位置指出FASTA的突變
mergefa merge two FASTA/Q files # 合并兩個(gè)的FASTA/Q files
famask apply a X-coded FASTA to a source FASTA#將X編碼的fa應(yīng)用到原fa
dropse drop unpaired from interleaved PE FASTA/Q# 從交錯(cuò)合并的fa/fq中丟棄不成對(duì)的序列
rename rename sequence names#序列重命名
randbase choose a random base from hets#從hets中隨機(jī)選一個(gè)堿基
cutN cut sequence at long N # 在N長(zhǎng)度處切掉序列
listhet extract the position of each het #提取每一個(gè)het位置
3、應(yīng)用
3.1、seqtk seq
$seqtk seq
Usage: seqtk seq [options] <in.fq>|<in.fa>
Options: -q INT mask bases with quality lower than INT [0] #將質(zhì)量小于INT的堿基,轉(zhuǎn)成小寫(xiě)的字母[atgc],INT默認(rèn)是0。
-X INT mask bases with quality higher than INT [255] #標(biāo)記質(zhì)量大于INT的堿基。INT默認(rèn)是255
-n CHAR masked bases converted to CHAR; 0 for lowercase [0]#用給出的CHAR字符代替標(biāo)記的堿基,0表示用原堿基的小寫(xiě)字母表示。默認(rèn)為0。這個(gè)參數(shù)應(yīng)該和-q 或-X 搭配使用
-l INT number of residues per line; 0 for 2^32-1 [0] #每條序列取前INT個(gè)堿基。默認(rèn)為0,表示保留整條序列
-Q INT quality shift: ASCII-INT gives base quality [33]#質(zhì)量變換,如ASCII-33
-s INT random seed (effective with -f) [11]#隨機(jī)種子,配合-f參數(shù),默認(rèn)11
-f FLOAT sample FLOAT fraction of sequences [1]#隨機(jī)取整個(gè)文件的FLOAT(例如:0.5)行,隨機(jī)數(shù)種子由-s決定。默認(rèn)為1,表示保留所有序列
-M FILE mask regions in BED or name list FILE [null]#FILE可以是BED文件。若是BED文件,就將BED文件中給定區(qū)間的堿基轉(zhuǎn)換成小寫(xiě)[atgc]序列;若是列表文件,則屏蔽掉給定的ID對(duì)應(yīng)的整條序列。默認(rèn)為空
-L INT drop sequences with length shorter than INT [0]#丟棄掉長(zhǎng)度小于INT的序列,默認(rèn)是0
-F CHAR fake FASTQ quality []# 用CHAR字符替換fq的質(zhì)量值
-c mask complement region (effective with -M)# 標(biāo)記互補(bǔ)區(qū)域,和-M參數(shù)配合使用
-r reverse complement#生成反向互補(bǔ)序列
-A force FASTA output (discard quality)#強(qiáng)制轉(zhuǎn)換成fa格式,也可以用-a
-C drop comments at the header lines#在標(biāo)題行刪除注釋
-N drop sequences containing ambiguous bases#丟棄掉含有不確定堿基N的序列
-1 output the 2n-1 reads only#僅輸入2n-1(奇數(shù))的reads
-2 output the 2n reads only#僅輸入2n-1(偶數(shù))的reads
-V shift quality by '(-Q) - 33' #通過(guò)-Q-33轉(zhuǎn)換質(zhì)量值
-U convert all bases to uppercases#所有堿基換成大寫(xiě)字母
-S strip of white spaces in sequences#刪除序列中的空白行
3.2、seqtk comp
$seqtk comp
Usage: seqtk comp [-u] [-r in.bed] <in.fa> # 獲取fa的堿基的組成信息,用-r參數(shù)可以輸出bed中的給定區(qū)間的序列
Output format: chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts
# 輸出格式:序列號(hào) 序列長(zhǎng)度 A C G T
3.3、seqtk sample
$seqtk sample
Usage: seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>
#隨機(jī)抽取序列,用法是seqtk sample fq/fa num
Options: -s INT RNG seed [11]#設(shè)置隨機(jī)種子,默認(rèn)11
-2 2-pass mode: twice as slow but with much reduced memory#占用更大的內(nèi)存
3.4、 seqtk subseq
$seqtk subseq
Usage: seqtk subseq [options] <in.fa> <in.bed>|<name.list>
#提取name.list中指定名稱(chēng)的fa序列,
Options: -t TAB delimited output# 輸出以tab分割
-l INT sequence line length [0]# 輸出序列以長(zhǎng)度INT換行
Note: Use 'samtools faidx' if only a few regions are intended.#注意:如果只有少數(shù)幾個(gè)區(qū)域,請(qǐng)使用'samtools faidx'
3.5、seqtk fqchk
$seqtk fqchk
Usage: seqtk fqchk [-q 20] <in.fq>#獲取每個(gè)堿基點(diǎn)的分布和質(zhì)量值,和fastqc質(zhì)控類(lèi)似,不過(guò)這里生成的是數(shù)據(jù),而fastqc生成質(zhì)控報(bào)告
Note: use -q0 to get the distribution of all quality values#用-q0來(lái)獲取所有質(zhì)量值的分布
3.6、seqtk mergepe
$seqtk mergepe
Usage: seqtk mergepe <in1.fq> <in2.fq>
# 交叉合并雙端測(cè)序的序列,pe就是pair end的意思
3.7、seqtk trimfq
$seqtk trimfq
Usage: seqtk trimfq [options] <in.fq>
Options: -q FLOAT error rate threshold (disabled by -b/-e) [0.05]#設(shè)置錯(cuò)誤率的閾值為FLOAT,以此作為修剪標(biāo)準(zhǔn)。此參數(shù)不可與-b/-e參數(shù)同時(shí)使用。默認(rèn)值為0.05
-l INT maximally trim down to INT bp (disabled by -b/-e) [30]#無(wú)論是否質(zhì)量低,序列保留到至少I(mǎi)NT長(zhǎng)度。此參數(shù)不可與-b/-e參數(shù)同時(shí)使用。默認(rèn)值為30。此參數(shù)可以看下圖(R2.fastq有三條read,測(cè)序質(zhì)量依次遞增)
-b INT trim INT bp from left (non-zero to disable -q/-l) [0]# 從序列左邊切除INT個(gè)堿基。此參數(shù)不可與-q/-l參數(shù)同時(shí)使用。默認(rèn)值為0
-e INT trim INT bp from right (non-zero to disable -q/-l) [0]#從序列右邊切除INT個(gè)堿基。此參數(shù)不可與-q/-l參數(shù)同時(shí)使用。默認(rèn)值為0
-L INT retain at most INT bp from the 5'-end (non-zero to disable -q/-l) [0]#保留從5'端起前INT個(gè)堿基
-Q force FASTQ output#強(qiáng)制輸出fq格式
3.8、seqtk mergefa
$seqtk mergefa
Usage: seqtk mergefa [options] <in1.fa> <in2.fa># 合并兩個(gè)的FASTA/Q files
Options: -q INT quality threshold [0]
-i take intersection#取交集
-m convert to lowercase when one of the input base is N
-r pick a random allele from het
-h suppress hets in the input
3.9、seqtk rename
$seqtk rename #序列重命名
Usage: seqtk rename <in.fq> [prefix]
3.10、seqtk cutN
$seqtk cutN# 在N長(zhǎng)度處切掉序列
Usage: seqtk cutN [options] <in.fa>
Options: -n INT min size of N tract [1000]
-p INT penalty for a non-N [10]
-g print gaps only, no sequence