seqtk的使用說(shuō)明

本文完全參考自簡(jiǎn)書(shū)文章:http://www.itdecent.cn/p/2671198ae625,本文僅供個(gè)人參考學(xué)習(xí),如有需要請(qǐng)?jiān)L問(wèn)原作者,侵權(quán)請(qǐng)?zhí)嵝褎h除,謝謝!

一、原理

seqtk是一款快速的、輕量級(jí)的fasta或fastq格式文件的處理工具,它可以無(wú)縫處理上述兩種文件,完成日常的如

二、處理方案

1、安裝

conda search seqtk
conda install -y seqtk

2、基本命令及介紹


$seqtk 

Usage:   seqtk <command> <arguments>
Version: 1.3-r106

Command:
          seq       common transformation of FASTA/Q # FASTA/Q 的轉(zhuǎn)換 
         comp      get the nucleotide composition of FASTA/Q # 獲取FASTA/Q的核苷酸組成
         sample    subsample sequences # 獲取樣本序列 
         subseq    extract subsequences from FASTA/Q # 提取子序列
         fqchk        fastq QC (base/quality summary) # fastq的質(zhì)控
         mergepe   interleave two PE FASTA/Q files #交叉合并雙端測(cè)序的兩個(gè)FASTA/Q files,合并后的file第一條序列是第一個(gè)fq的第一條,合并后的file第二條是序列是第二個(gè)fq的第一條
         trimfq         trim FASTQ using the Phred algorithm # 用Phred算法對(duì)fq修剪

         hety          regional heterozygosity # 區(qū)域性雜合
         gc            identify high- or low-GC regions# 識(shí)別高低GC含量的區(qū)域
         mutfa       point mutate FASTA at specified positions#在特定位置指出FASTA的突變
         mergefa     merge two FASTA/Q files # 合并兩個(gè)的FASTA/Q files
         famask       apply a X-coded FASTA to a source FASTA#將X編碼的fa應(yīng)用到原fa
         dropse       drop unpaired from interleaved PE FASTA/Q# 從交錯(cuò)合并的fa/fq中丟棄不成對(duì)的序列
         rename      rename sequence names#序列重命名
         randbase   choose a random base from hets#從hets中隨機(jī)選一個(gè)堿基
         cutN           cut sequence at long N # 在N長(zhǎng)度處切掉序列
         listhet         extract the position of each het #提取每一個(gè)het位置    

3、應(yīng)用

3.1、seqtk seq

$seqtk seq

Usage:   seqtk seq [options] <in.fq>|<in.fa>

Options: -q INT    mask bases with quality lower than INT [0] #將質(zhì)量小于INT的堿基,轉(zhuǎn)成小寫(xiě)的字母[atgc],INT默認(rèn)是0。
         -X INT    mask bases with quality higher than INT [255] #標(biāo)記質(zhì)量大于INT的堿基。INT默認(rèn)是255
         -n CHAR   masked bases converted to CHAR; 0 for lowercase [0]#用給出的CHAR字符代替標(biāo)記的堿基,0表示用原堿基的小寫(xiě)字母表示。默認(rèn)為0。這個(gè)參數(shù)應(yīng)該和-q 或-X 搭配使用
         -l INT    number of residues per line; 0 for 2^32-1 [0] #每條序列取前INT個(gè)堿基。默認(rèn)為0,表示保留整條序列
         -Q INT    quality shift: ASCII-INT gives base quality [33]#質(zhì)量變換,如ASCII-33
         -s INT    random seed (effective with -f) [11]#隨機(jī)種子,配合-f參數(shù),默認(rèn)11
         -f FLOAT  sample FLOAT fraction of sequences [1]#隨機(jī)取整個(gè)文件的FLOAT(例如:0.5)行,隨機(jī)數(shù)種子由-s決定。默認(rèn)為1,表示保留所有序列
         -M FILE   mask regions in BED or name list FILE [null]#FILE可以是BED文件。若是BED文件,就將BED文件中給定區(qū)間的堿基轉(zhuǎn)換成小寫(xiě)[atgc]序列;若是列表文件,則屏蔽掉給定的ID對(duì)應(yīng)的整條序列。默認(rèn)為空
         -L INT    drop sequences with length shorter than INT [0]#丟棄掉長(zhǎng)度小于INT的序列,默認(rèn)是0
         -F CHAR   fake FASTQ quality []# 用CHAR字符替換fq的質(zhì)量值
         -c        mask complement region (effective with -M)# 標(biāo)記互補(bǔ)區(qū)域,和-M參數(shù)配合使用
         -r        reverse complement#生成反向互補(bǔ)序列
         -A        force FASTA output (discard quality)#強(qiáng)制轉(zhuǎn)換成fa格式,也可以用-a
         -C        drop comments at the header lines#在標(biāo)題行刪除注釋
         -N        drop sequences containing ambiguous bases#丟棄掉含有不確定堿基N的序列
         -1        output the 2n-1 reads only#僅輸入2n-1(奇數(shù))的reads
         -2        output the 2n reads only#僅輸入2n-1(偶數(shù))的reads
         -V        shift quality by '(-Q) - 33' #通過(guò)-Q-33轉(zhuǎn)換質(zhì)量值
         -U        convert all bases to uppercases#所有堿基換成大寫(xiě)字母
         -S        strip of white spaces in sequences#刪除序列中的空白行

3.2、seqtk comp

$seqtk comp
Usage:  seqtk comp [-u] [-r in.bed] <in.fa> # 獲取fa的堿基的組成信息,用-r參數(shù)可以輸出bed中的給定區(qū)間的序列

Output format: chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts
# 輸出格式:序列號(hào)  序列長(zhǎng)度    A   C   G   T  

3.3、seqtk sample

$seqtk sample 

Usage:   seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>
#隨機(jī)抽取序列,用法是seqtk sample fq/fa num

Options: -s INT       RNG seed [11]#設(shè)置隨機(jī)種子,默認(rèn)11
         -2           2-pass mode: twice as slow but with much reduced memory#占用更大的內(nèi)存

3.4、 seqtk subseq

$seqtk subseq

Usage:   seqtk subseq [options] <in.fa> <in.bed>|<name.list>
#提取name.list中指定名稱(chēng)的fa序列,
Options: -t       TAB delimited output# 輸出以tab分割
         -l INT   sequence line length [0]# 輸出序列以長(zhǎng)度INT換行

Note: Use 'samtools faidx' if only a few regions are intended.#注意:如果只有少數(shù)幾個(gè)區(qū)域,請(qǐng)使用'samtools faidx'

3.5、seqtk fqchk

$seqtk fqchk
Usage: seqtk fqchk [-q 20] <in.fq>#獲取每個(gè)堿基點(diǎn)的分布和質(zhì)量值,和fastqc質(zhì)控類(lèi)似,不過(guò)這里生成的是數(shù)據(jù),而fastqc生成質(zhì)控報(bào)告
Note: use -q0 to get the distribution of all quality values#用-q0來(lái)獲取所有質(zhì)量值的分布

3.6、seqtk mergepe

$seqtk mergepe
Usage: seqtk mergepe <in1.fq> <in2.fq>
# 交叉合并雙端測(cè)序的序列,pe就是pair end的意思

3.7、seqtk trimfq

$seqtk trimfq

Usage:   seqtk trimfq [options] <in.fq>

Options: -q FLOAT    error rate threshold (disabled by -b/-e) [0.05]#設(shè)置錯(cuò)誤率的閾值為FLOAT,以此作為修剪標(biāo)準(zhǔn)。此參數(shù)不可與-b/-e參數(shù)同時(shí)使用。默認(rèn)值為0.05
         -l INT      maximally trim down to INT bp (disabled by -b/-e) [30]#無(wú)論是否質(zhì)量低,序列保留到至少I(mǎi)NT長(zhǎng)度。此參數(shù)不可與-b/-e參數(shù)同時(shí)使用。默認(rèn)值為30。此參數(shù)可以看下圖(R2.fastq有三條read,測(cè)序質(zhì)量依次遞增)
         -b INT      trim INT bp from left (non-zero to disable -q/-l) [0]# 從序列左邊切除INT個(gè)堿基。此參數(shù)不可與-q/-l參數(shù)同時(shí)使用。默認(rèn)值為0
         -e INT      trim INT bp from right (non-zero to disable -q/-l) [0]#從序列右邊切除INT個(gè)堿基。此參數(shù)不可與-q/-l參數(shù)同時(shí)使用。默認(rèn)值為0
         -L INT      retain at most INT bp from the 5'-end (non-zero to disable -q/-l) [0]#保留從5'端起前INT個(gè)堿基
         -Q          force FASTQ output#強(qiáng)制輸出fq格式

3.8、seqtk mergefa

$seqtk mergefa

Usage: seqtk mergefa [options] <in1.fa> <in2.fa># 合并兩個(gè)的FASTA/Q files

Options: -q INT   quality threshold [0]
         -i       take intersection#取交集
         -m       convert to lowercase when one of the input base is N
         -r       pick a random allele from het
         -h       suppress hets in the input

3.9、seqtk rename

$seqtk rename #序列重命名
Usage: seqtk rename <in.fq> [prefix]

3.10、seqtk cutN

$seqtk cutN# 在N長(zhǎng)度處切掉序列

Usage:   seqtk cutN [options] <in.fa>

Options: -n INT    min size of N tract [1000]
         -p INT    penalty for a non-N [10]
         -g        print gaps only, no sequence
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容