Cutadapt安裝
#使用pip安裝
$ pip install --user --upgrade cutadapt
#使用conda安裝
$ conda install -c bioconda cutadapt
基本命令
Usage:
cutadapt -a ADAPTER options input.fastq
cutadapt -a AACCGGTT input.fastq > output.fastq
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
輸入輸出文件格式
.fasta, .fa, .fna, .fastq, .fq, .gz, .bz2, .xz
多核運行(Python 3.3 or later)
-j N
例子:
1.1 修剪reads 5'端adapter
cutadapt -g ADAPTER -O 5 -e 0 -o sample.trimmed.fastq sample.fastq --minimum-length 35
--discard-untrimmed --info-file=reads.adapter.txt
1.2 修剪reads 3'端adapter
cutadapt -a ADAPTER -O 5 -e 0 -o sample.trimmed.fastq sample.fastq --minimum-length 35
--discard-untrimmed --info-file=reads.adapter.txt
2 雙端測序
cutadapt -g ADAPTER_FWD -G ADAPTER_REV -O 5 -e 0 -o sample.trimmed.1.fastq -p sample.trimmed.2.fastq sample.reads_1.fastq sample.reads_2.fastq --minimum-length 35 --discard-untrimmed --info-file=reads.adapter.txt
3 去掉poly-A
cutadapt -a "A{10}" -o output.fastq input.fastq
4 修剪reads 5’ (正數(shù))或3’(負數(shù)) 固定長度
cutadapt -u 5 -o trimmed.fastq reads.fastq
cutadapt -u -7 -o trimmed.fastq reads.fastq
5 修剪低質(zhì)量reads
cutadapt -q 10 -o output.fastq input.fastq
cutadapt -q 15,10 -o output.fastq input.fastq
6 修剪reads 3’ 端堿基使得reads到一定長度(10bp)
cutadapt -l 10 -o output.fastq.gz input.fastq.gz
7 Multiple adapters #多種adapter
cutadapt -g TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq #最佳匹配adapter移除
cutadapt -g file:adapters.fasta -o output.fastq input.fastq
8 Demultiplexing #adapter解析;reads根據(jù)adapter不同輸出到不同的文件
cutadapt -a one=TATA -a two=GCGC -o trimmed-{name}.fastq.gz input.fastq.gz
#result files: demulti-one.fastq.gz, demulti-two.fastq.gz and demulti-unknown.fastq.gz
cutadapt -a file:barcodes.fasta --no-trim --untrimmed-o untrimmed.fastq.gz -o trimmed-{name}.fastq.gz input.fastq.gz
#result files: trimmed-first.1.fastq.gz, trimmed-second.1.fastq.gz, trimmed-unknown.1.fastq.gz and trimmed-first.2.fastq.gz, trimmed-second.2.fastq.gz, trimmed-unknown.2.fastq.gz
9 從一個read修剪多個adapter
cutadapt -g ^ADAPTER -n 5 -o output.fastq.gz input.fastq.gz
常用參數(shù)
-g: #剪切reads 5'端adapter(雙端測序第一條read),加$表示adapter錨定在reads 5'端
-a: #剪切reads 3'端adapter(雙端測序第一條read),加$表示adapter錨定在reads3'端
-O MINLENGTH, --overlap=MINLENGTH #adapter與reads最小overlap,才算成功識別; Default: 3
-m LENGTH, --minimum-length=LENGTH: 根據(jù)最短長度篩選reads;Default: 0
--discard-untrimmed, --trimmed-only #丟掉不包含adapter的reads
-e ERROR_RATE, --error-rate=ERROR_RATE #adapter匹配允許的最大錯配率(錯配/匹配片段長度);Default: 0.1
--no-trim: 不修剪adapter,直接輸出滿足跳進啊的reads
-u LENGTH, --cut=LENGTH: #修剪reads 5'/3'端堿基,正數(shù):從開始除移除堿基;負數(shù):從末尾處移除堿基;
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF: #修剪低質(zhì)量堿基
-l LENGTH, --length=LENGTH: #將reads修剪的最終長度
--trim-n: #修剪reads末端的'N'
-o FILE, --output=FILE: #輸出文件
--info-file=FILE:每條reads和與reads匹配的adapter的信息
--too-short-output=FILE: #為reads長度最小值設(shè)定閾值篩選reads后,要丟棄的部分輸出到文件;長度依據(jù)m值設(shè)定;
--too-long-output=FILE:#為reads長度最大值設(shè)定閾值篩選reads后,要丟棄的部分輸出到文件;長度依據(jù)M值設(shè)定;
--untrimmed-output=FILE: #將沒有adapter未做修剪的reads輸出到一個文件;默認輸出到trimmed reads結(jié)果文件
--max-n=COUNT:#reads中N的數(shù)量,設(shè)定整數(shù)或小數(shù)(N的占比)
雙端測序參數(shù)
-A ADAPTER: #第二條reads 3'adapter
-G ADAPTER:#第二條reads 5'adapter
-U LENGTH: #從第二條reads上修剪的長度
-p FILE, --paired-output=FILE: #第二條reads的輸出結(jié)果
--untrimmed-paired-output=FILE:#第一條reads沒有adapter時,將第二條reads輸出到文件;默認輸出到trimmed reads結(jié)果文件
參數(shù)詳解
$ cutadapt --help
Options:
--version
-h, --help
--debug Print debugging information.
-f FORMAT, --format=FORMAT #文件格式
Input file format:'fasta', 'fastq' or 'sra-fastq'.
Ignored when reading csfasta/qual files.
Default: auto-detect from file name extension.
Finding adapters::
Parameters -a, -g, -b specify adapters to be removed from each read
(or from the first read in a pair if data is paired). If specified
multiple times, only the best matching adapter is trimmed (but see the
--times option). When the special notation 'file:FILE' is used,
adapter sequences are read from the given FASTA file.
-a ADAPTER, --adapter=ADAPTER #剪切reads 3'端adapter,加$表示congreads 3'端第一個堿基(無法adapter部分匹配識別)
Sequence of an adapter ligated to the 3' end (paired
data: of the first read). The adapter and subsequent
bases are trimmed. If a '$' character is appended
('anchoring'), the adapter is only found if it is a
suffix of the read.匹配adapter
-g ADAPTER, --front=ADAPTER #剪切reads 5'端adapter,加$表示congreads 5'端第一個堿基匹配adapter(無法adapter部分匹配識別)
Sequence of an adapter ligated to the 5' end (paired
data: of the first read). The adapter and any
preceding bases are trimmed. Partial matches at the 5'
end are allowed. If a '^' character is prepended
('anchoring'), the adapter is only found if it is a
prefix of the read.
-b ADAPTER, --anywhere=ADAPTER #adapter在 5'和3'端都可能出現(xiàn)時使用,慎用
Sequence of an adapter that may be ligated to the 5'
or 3' end (paired data: of the first read). Both types
of matches as described under -a und -g are allowed.
If the first base of the read is part of the match,
the behavior is as with -g, otherwise as with -a. This
option is mostly for rescuing failed library preparations
-do not use if you know which end your adapter was ligated to!
-e ERROR_RATE, --error-rate=ERROR_RATE #adapter匹配允許的最大錯配率(錯配/匹配片段長度)
Maximum allowed error rate (no. of errors divided by
the length of the matching region). Default: 0.1
--no-indels Allow only mismatches in alignments. Default: allow
both mismatches and indels 禁止adapter發(fā)生Insertions和deletions
-n COUNT, --times=COUNT #從reads行修剪adapter次數(shù)
Remove up to COUNT adapters from each read. Default: 1
-O MINLENGTH, --overlap=MINLENGTH #adapter與reads最小overlap,才算成功識別;Default: 3
If the overlap between the read and the adapter is
shorter than MINLENGTH, the read is not modified.
Reduces the no. of bases trimmed due to random adapter
matches. Default: 3
--match-read-wildcards
Interpret IUPAC wildcards in reads. Default: False
-N, --no-match-adapter-wildcards
Do not interpret IUPAC wildcards in adapters.
--no-trim Match and redirect reads to output/untrimmed-output as
usual, but do not remove adapters. #不修剪adapter輸出reads
--mask-adapter Mask adapters with 'N' characters instead of trimming
them. #識別adapter后,用'N'代替adapter
Additional read modifications:
-u LENGTH, --cut=LENGTH #修剪reads 5'/3'端堿基,正數(shù):從開始除移除堿基;負數(shù):從末尾處移除堿基;
Remove bases from each read (first read only if
paired). If LENGTH is positive, remove bases from the
beginning. If LENGTH is negative, remove bases from
the end. Can be used twice if LENGTHs have different
signs.
--nextseq-trim=3'CUTOFF
NextSeq-specific quality trimming (each read). Trims
also dark cycles appearing as high-quality G bases
(EXPERIMENTAL).
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF #修剪低質(zhì)量堿基
Trim low-quality bases from 5' and/or 3' ends of each
read before adapter removal. Applied to both reads if
data is paired. If one value is given, only the 3' end
is trimmed. If two comma-separated cutoffs are given,
the 5' end is trimmed with the first cutoff, the 3'
end with the second.
--quality-base=QUALITY_BASE
Assume that quality values in FASTQ are encoded as
ascii(quality + QUALITY_BASE). This needs to be set to
64 for some old Illumina FASTQ files. Default: 33
-l LENGTH, --length=LENGTH #將reads修剪的最終長度
Shorten reads to LENGTH. This and the following
modificationsare applied after adapter trimming.
--trim-n Trim N's on ends of reads. #修剪reads末端的'N'
--length-tag=TAG Search for TAG followed by a decimal number in the
description field of the read. Replace the decimal
number with the correct length of the trimmed read.
For example, use --length-tag 'length=' to correct
fields like 'length=123'. #修剪reads后,用此參數(shù)修改reads的長度
--strip-suffix=STRIP_SUFFIX
Remove this suffix from read names if present. Can be
given multiple times.
-x PREFIX, --prefix=PREFIX #為reads名添加前綴,使用{name}添加adapter名字
Add this prefix to read names. Use {name} to insert
the name of the matching adapter.
-y SUFFIX, --suffix=SUFFIX #為reads名添加后綴,使用{name}添加adapter名字
Add this suffix to read names; can also include {name}
Filtering of processed reads:
-m LENGTH, --minimum-length=LENGTH #修建前后,reads短于最短長度時,丟棄這對reads
Discard trimmed reads that are shorter than LENGTH.
Reads that are too short even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted. Default: 0
-M LENGTH, --maximum-length=LENGTH #修建前后,reads長于最大長度時,丟棄這對reads
Discard trimmed reads that are longer than LENGTH.
Reads that are too long even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted. Default: no limit
--max-n=COUNT Discard reads with too many N bases. If COUNT is an
integer, it is treated as the absolute number of N
bases. If it is between 0 and 1, it is treated as the
proportion of N's allowed in a read. #reads中N的數(shù)量,設(shè)定整數(shù)或小數(shù)(N的占比)
--discard-trimmed, --discard #丟棄只有一個adapter的reads
Discard reads that contain an adapter. Also use -O to
avoid discarding too many randomly matching reads!
--discard-untrimmed, --trimmed-only #丟掉不包含adapter的reads
Discard reads that do not contain the adapter.
Output:
--quiet Print only error messages.
-o FILE, --output=FILE #輸出文件
Write trimmed reads to FILE. FASTQ or FASTA format is
chosen depending on input. The summary report is sent
to standard output. Use '{name}' in FILE to
demultiplex reads into multiple files. Default: write
to standard output
--info-file=FILE Write information about each read and its adapter
matches into FILE. #每條reads與adapter的匹配信息
-r FILE, --rest-file=FILE
When the adapter matches in the middle of a read,
write the rest (after the adapter) into FILE.
--wildcard-file=FILE
When the adapter has N bases (wildcards), write
adapter bases matching wildcard positions to FILE.
When there are indels in the alignment, this will
often not be accurate.
--too-short-output=FILE #為reads長度最大值設(shè)定閾值篩選reads后,要丟棄的部分輸出到文件;
Write reads that are too short (according to length
specified by -m) to FILE. Default: discard reads
--too-long-output=FILE #將依據(jù)最長長度篩選reads后,要丟棄的部分輸出到文件;
Write reads that are too long (according to length
specified by -M) to FILE. Default: discard reads
--untrimmed-output=FILE #將沒有adapter未做修剪的reads輸出到一個文件,而不是輸出到trimmed reads結(jié)果文件
Write reads that do not contain any adapter to FILE.
Default: output to same file as trimmed reads
Colorspace options:
-c, --colorspace Enable colorspace mode: Also trim the color that is
adjacent to the found adapter.
-d, --double-encode
Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).
-t, --trim-primer Trim primer base and the first color (which is the
transition to the first nucleotide)
--strip-f3 Strip the _F3 suffix of read names
--maq, --bwa MAQ- and BWA-compatible colorspace output. This
enables -c, -d, -t, --strip-f3 and -y '/1'.
--no-zero-cap Do not change negative quality values to zero in
colorspace data. By default, they are since many tools
have problems with negative qualities.
-z, --zero-cap Change negative quality values to zero. This is
enabled by default when -c/--colorspace is also
enabled. Use the above option to disable it.
Paired-end options: #雙端測序參數(shù)
The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts, but
are applied to the second read in each pair.
-A ADAPTER 3' adapter to be removed from second read in a pair. #第二條reads 3'adapter
-G ADAPTER 5' adapter to be removed from second read in a pair. #第二條reads 5'adapter
-B ADAPTER 5'/3 adapter to be removed from second read in a pair.#第二條reads 5'/3'adapter
-U LENGTH Remove LENGTH bases from second read in a pair #從第二條reads上修剪的長度
-p FILE, --paired-output=FILE #第二條reads的輸出結(jié)果
Write second read in a pair to FILE.
--pair-filter=(any|both)
Which of the reads in a paired-end read have to match
the filtering criterion in order for it to be
filtered. Default: any
--interleaved Read and write interleaved paired-end reads.
--untrimmed-paired-output=FILE #第一條reads沒有adapter時,將第二條reads輸出到文件;默認輸出到trimmed reads結(jié)果文件
Write second read in a pair to this FILE when no
adapter was found in the first read. Use this option
together with --untrimmed-output when trimming paired-
end reads. Default: output to same file as trimmed
reads
--too-short-paired-output=FILE #將reads2中太短的reads輸出到文件
Write second read in a pair to this file if pair is
too short. Use together with --too-short-output.
--too-long-paired-output=FILE #將reads2中太長的reads輸出到文件
Write second read in a pair to this file if pair is
too long. Use together with --too-long-output.
Removing adapters
Adapter 移除
| Adapter type | Command-line option |
|---|---|
| 3’ adapter | -a ADAPTER |
| 5’ adapter | -g ADAPTER |
| Anchored 3’ adapter | -a ADAPTER$ |
| Anchored 5’ adapter | -g ^ADAPTER |
| 5’ or 3’ (both possible) | -b ADAPTER |
| Linked adapter | -a ADAPTER1...ADAPTER2 |
| Non-anchored linked adapter | -g ADAPTER1...ADAPTER2 |
Anchored 5’ adapters:adapters錨定在reads開始處,相當(dāng)于adapters就是這條reads的前綴一樣;
Anchored 5’ adapters的頭(a)必須與reads頭匹配,才會修剪,除非允許indel;
Anchored 3’ adapters的尾(r)必須與reads尾匹配,才會修剪,除非允許indel;

Regular 3’ adapters修剪
| Reads | Adapter | Reault |
|---|---|---|
| MYSEQUEN | -a ADAPTER | MYSEQUEN |
| MYSEQUENCEADAP | MYSEQUENCE | |
| MYSEQUENCEADAPTER | MYSEQUENCE | |
| MYSEQUENCEADAPTERSOMETHINGELSE | MYSEQUENCE | |
| ADAPTERSOMETHING | 修剪為0bp,導(dǎo)致無意義,reads保留到output |
Regular 5’ adapters修剪**
| Reads | Adapter | Reault |
|---|---|---|
| ADAPTERMYSEQUENCE | -g ADAPTER | MYSEQUENCE |
| DAPTERMYSEQUENCE | MYSEQUENCE | |
| TERMYSEQUENCE | MYSEQUENCE | |
| SOMETHINGADAPTER | 修剪為0bp,導(dǎo)致無意義,reads保留到output |
Anchored 5’ adapters #錨定5’ adapters修剪
| 識別的barcode | Adapter | |
|---|---|---|
| ADAPTER | -g ^ADAPTER | |
| ADAPT | -g ^ADAPTER | |
| ADA | -g ^ADAPTER |
允許錯配的話,可能會允許插入的發(fā)生:
BADAPTERSOMETHING 也可被-g ^ADAPTER識別,為防止這種現(xiàn)象發(fā)生;可以加參數(shù)--no-indels
Anchored 3’ adapters #錨定3’ adapters修剪
| reads | adapter | result |
|---|---|---|
| MYSEQUENCEADAP | -a ADAPTER$ |
MYSEQUENCEADAP |
| MYSEQUENCEADAPTER | -a ADAPTER$ |
MYSEQUENCE |
| MYSEQUENCEADAPTERSOMETHINGELSE | -a ADAPTER$ |
MYSEQUENCEADAPTERSOMETHINGELSE |
Linked adapters (combined 5’ and 3’ adapter)
本情況下,5’ 的ADAPTER1是an anchored 5’ adapter
-a ADAPTER1...ADAPTER2: #5’ 正常修剪,3’ 只有 5’存在adapter才會發(fā)生修剪;
-a ADAPTER1...ADAPTER2$: #只有 5’和3’ 都存在adapter才會發(fā)生修剪;
cutadapt -a FIRST...SECOND -o output.fastq input.fastq
| reads | adapter | result |
|---|---|---|
| FIRSTMYSEQUENCESECONDEXTRABASES | -a FIRST...SECOND | MYSEQUENCE |
| FIRSTMYSEQUENCESEC | MYSEQUENCE | |
| FIRSTMYSEQUE | MYSEQUENCE | |
| ANOTHERREADSECOND | MYSEQUENCE |
Linked adapters without anchoring
-g ADAPTER1...ADAPTER2
The 5’ adapter is not anchored by default. (So neither the 5’ nor 3’ adapter are anchored.) #5’和3’都不是錨定的
-
Both adapters are required. If one of them is not found, the read is not trimmed. #5’和3’同時存在的,adapter才會被剪切
此模式下,可以丟掉不包含adapter的reads;使用參數(shù):
--discard-untrimmed, --trimmed-only
Discard reads that do not contain the adapter.
Linked adapter statistics
=== Adapter 1 ===
Sequence: AAAAAAAAA...TTTTTTTTTT; Type: linked; Length: 9+10; Trimmed: 3 times; Half matches: 2
只適用于 (non-anchored) 3’ adapters情況下;
Half matches:只有5’ adapters的情況;
5’ or 3’ adapters
-b ADAPTER (or use the longer spelling --anywhere ADAPTER).
adapter允許出現(xiàn)在reads的任何位置;允許adapter降解;允許partially match;如果adapter前面還有堿基,adapter歸于3’ adapters
| Read before trimming | Read after trimming | Detected adapter type |
|---|---|---|
MYSEQUENCEADAPTERSOMETHING |
MYSEQUENCE |
3’ adapter |
MYSEQUENCEADAPTER |
MYSEQUENCE |
3’ adapter |
MYSEQUENCEADAP |
MYSEQUENCE |
3’ adapter |
MADAPTER |
M |
3’ adapter |
ADAPTERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
PTERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
TERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
Error tolerance
3種情況:mismatches, insertions and deletions
-e ERROR_RATE, --error-rate=ERROR_RATE #adapter匹配允許的最大錯配率=錯配/匹配片段長度(adapter全長)
Maximum allowed error rate (no. of errors divided by
the length of the matching region). Default: 0.1
--no-indels:禁止adapter發(fā)生Insertions和deletions
Sequence: 'LONGADAPTER'; Length: 11; Trimmed: 2 times.
No. of allowed errors:
0-9 bp: 0; 10-11 bp: 1
11bp adapter,前9bp需要完全匹配,10-11 bp允許1bp錯配;
40bp adapter,0.1的最大錯誤率,允許最多錯誤4bp,這4bp不是均勻分給reads各個位置的;
Multiple adapter occurrences within a single read
5’ and 3’ adapters均識別最左邊的adapter
Reducing random matches
-O MINLENGTH, --overlap=MINLENGTH #adapter與reads最小overlap,才算成功識別,最少3bp,再少就丟掉reads
Wildcards
使用N,Y,T這種通配符
cutadapt -a ACGTAANNNNTTAGC -o output.fastq input.fastq
Repeated bases in the adapter sequence
可以去除ploy-A
AAAAAAAAAA)等價于A{10}
cutadapt -a "A{100}" -o output.fastq input.fastq
Modifying reads
對reads的修剪
Removing a fixed number of bases #移除一定數(shù)目堿基
-u LENGTH, --cut=LENGTH #修剪reads 5'/3' 堿基,正數(shù)表示從 5'移除堿基,負數(shù)表示從3'移除堿基
cutadapt -u 5 -o trimmed.fastq reads.fastq
Quality trimming
修剪低質(zhì)量堿基,默認phred quality + 33;使用64時,需加參數(shù):--quality-base=64
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
cutadapt -q 10 -o output.fastq input.fastq #默認是3'進行質(zhì)量修剪
cutadapt -q 15,10 -o output.fastq input.fastq #5'/3' 低質(zhì)量堿基都修剪
cutadapt -q 15,0 -o output.fastq input.fastq #5'進行質(zhì)量修剪
Quality trimming of reads using two-color chemistry (NextSeq)
cutadapt --nextseq-trim=20 -o out.fastq input.fastq
Quality trimming algorithm
一段序列質(zhì)量編碼如下:
42, 40, 26, 27, 8, 7, 11, 4, 2, 3
首先減去閾值10
32, 30, 16, 17, -2, -3, 1, -6, -8, -7
從末端開始累加,
(70), (38), 8, -8, -25, -23, -20, -21, -15, -7
因為-25 最小,所以保留-25 之前的堿基, 即保留前4位堿基
Shortening reads to a fixed length
-l LENGTH, --length=LENGTH #修剪reads到固定長度
cutadapt -l 10 -o output.fastq.gz input.fastq.gz
Modifying read names
修改reads名字
-y SUFFIX, --suffix=SUFFIX #為reads名添加后綴,使用{name}添加adapter名字
cutadapt -a adapter1=ACGT -y ' we found {name}' input.fastq
如果reads名中有reads長度,使用參數(shù)--length-tag 'length='更新
Read modification order
Filtering reads
reads的過濾
--minimum-length LENGTH or -m LENGTH #根據(jù)最短長度篩選reads;
--too-short-output FILE #為reads長度最小值設(shè)定閾值篩選reads后,要丟棄的部分輸出到文件;
--maximum-length LENGTH or -M LENGTH #根據(jù)最長長度篩選reads;
--too-long-output FILE #為reads長度最大值設(shè)定閾值篩選reads后,要丟棄的部分輸出到文件;
--untrimmed-output FILE #將沒有adapter未做修剪的reads輸出到一個文件,而不是輸出到trimmed reads結(jié)果文件
--discard-trimmed #丟棄只有一個adapter的reads
--discard-trimmed 等價于 --untrimmed-output /dev/null #丟棄沒有adapter的reads
Trimming paired-end reads
cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq
-U: 修剪第二條reads的堿基;
對成對的reads都起作用的參數(shù):
-
-q(along with--quality-base) -
--timesapplies to all the adapters given --no-trim--trim-n--mask--length--length-tag-
--prefix,--suffix --strip-f3-
--colorspace,--bwa,-z,--no-zero-cap,--double-encode,--trim-primer
雙端測序reads的過濾:
| Filtering option | With --pair-filter=any, the pair is discarded if … |
With -pair-filter=both, the pair is discarded if … |
|---|---|---|
--minimum-length |
one of the reads is too short | both reads are too short |
--maximum-length |
one of the reads is too long | both reads are too long |
--discard-trimmed |
one of the reads contains an adapter | both reads contain an adapter |
--discard-untrimmed |
one of the reads does not contain an adapter | both reads do not contain an adapter |
--discard-untrimmed |
one of the reads does not contain an adapter | both reads do not contain an adapter |
交叉雙端測序reads:
也是雙端測序,只是所有reads在一個文件;read1在read2前面;
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq
正常雙端測序,輸出這種格式文件:
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fastq
Legacy paired-end read trimming #雙端測序修剪
雙端測序文件單獨處理:(雙端測序沒有共同的參數(shù)的處理)
cutadapt -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
cutadapt -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
cutadapt -q 10 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
cutadapt -q 15 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
雙端測序文件單獨處理:(雙端測序經(jīng)過共同參數(shù)的處理)
#先處理read1文件,使用-p生成臨時文件
cutadapt -q 10 -a ADAPTER_FWD --minimum-length 20 -o tmp.1.fastq -p tmp.2.fastq reads.1.fastq reads.2.fastq
#先處理read1文件,使用臨時文件作為輸入
cutadapt -q 15 -a ADAPTER_REV --minimum-length 20 -o trimmed.2.fastq -p trimmed.1.fastq tmp.2.fastq tmp.1.fastq
#移除臨時文件
rm tmp.1.fastq tmp.2.fastq
Multiple adapters
多條adapter
兩條3’ adapters
cutadapt -a TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq
使用adapters文件,fasta格式
cutadapt -a file:adapters.fasta -o output.fastq input.fastq
- All given adapter sequences are matched to the read.
- Adapter matches where the overlap length (see the
-Oparameter) is too small or where the error rate is too high (-e) are removed from further consideration. - Among the remaining matches, the one with the greatest number of matching bases is chosen.
- If there is a tie, the first adapter wins. The order of adapters is the order in which they are given on the command line or in which they are found in the FASTA file.
Named adapters
cutadapt -a My_Adapter=AACCGGTT -o output.fastq input.fastq
Demultiplexing
cutadapt -a one=TATA -a two=GCGC -o trimmed-{name}.fastq.gz input.fastq.gz
#ouput files:demulti-one.fastq.gz, demulti-two.fastq.gz, demulti-unknown.fastq.gz
cutadapt -a file:barcodes.fasta --no-trim --untrimmed-o untrimmed.fastq.gz -o trimmed-{name}.fastq.gz input.fastq.gz
--no-trim: #不修剪reads,保留adapter
--untrimmed-output: #輸出沒有adapter的reads到新文件,默認輸出到正常trimmed reads結(jié)果文件
--untrimmed-paired-output:接受PE測序,沒有adapter的reads2輸出
barcodes.fasta
>barcode01
TTAAGGCC
>barcode02
TAGCTAGC
>barcode03
ATGATGAT
cutadapt -a first=AACCGG -a second=TTTTGG -A ACGTACGT -A TGCATGCA -o trimmed-{name}.1.fastq.gz -p trimmed-{name}.2.fastq.gz input.1.fastq.gz input.2.fastq.gz
#ouput files:
trimmed-first.1.fastq.gz, trimmed-second.1.fastq.gz, trimmed-unknown.1.fastq.gz
trimmed-first.2.fastq.gz, trimmed-second.2.fastq.gz, trimmed-unknown.2.fastq.gz
--untrimmed-paired-output ##輸出沒有adapter的reads到文件
PE reads,一般情況下檢測reads1的adapter決定這對pair reads;如果想考慮reads2,使用參數(shù)-A/-G
Trimming more than one adapter from each read
cutadapt -g ^ADAPTER -n 5 -o output.fastq.gz input.fastq.gz
#直到在reads中找到5次adapter就停止尋找
#-n COUNT, --times=COUNT #從reads行修剪adapter次數(shù)
cutadapt -g ^FIRST -a SECOND -n 2 ...
Illumina TruSeq
單端和雙端reads1: A + the “TruSeq Indexed Adapter”
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -o trimmed.fastq.gz reads.fastq.gz
PE reads:
reads2: TruSeq Universal Adapter
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
-o trimmed.1.fastq.gz -p trimmed.2.fastq.gz \
reads.1.fastq.gz reads.2.fastq.gz
Warning about incomplete adapter sequences
WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.
Bases preceding removed adapters:
A: 95.5%
C: 1.0%
G: 1.6%
T: 1.6%
none/other: 0.3%
WARNING:
The adapter is preceded by "A" extremely often.
The provided adapter sequence may be incomplete.
To fix the problem, add "A" to the beginning of the adapter sequence.
擴增子測序可以忽略這些警告。
Dealing with N bases
--max-n COUNT #丟掉超過一定數(shù)目N 的reads;
--trim-n #修剪reads末端的'N'
Bisulfite sequencing (RRBS)
需要移除3’ 端除adapter外的兩個堿基;
cutadapt -a NNADAPTER -o output.fastq input.fastq
Cutadapt’s output
No. of allowed errors:
0-7 bp: 0; 8-15 bp: 1; 16-20 bp: 2
0-7 bp,不允許錯配;8-15 bp,允許錯配1bp; 16-20 bp允許錯配 2bp;
Overview of removed sequences
length count expect max.err error counts
3 140 156.2 0 140
4 57 39.1 0 57
5 50 9.8 0 50
6 35 2.4 0 35
7 13 0.3 0 1 12
8 31 0.1 1 0 31
...
100 397 0.0 3 358 36 3
Format of the info file
--info-file=FILE Write information about each read and its adapter
matches into FILE. #每條reads與adapter的匹配信息
The alignment algorithm
unit costs
mismatches, insertions and deletions are counted as one error,這樣就只需要一個參數(shù)(最大錯誤率)就可以判斷是否進行下一步操作
基本理念:在允許的錯誤率下 ,adapter與reads匹配的區(qū)域最大化;
the procedure is as follows:
- Consider all possible overlaps between the two sequences and compute an alignment for each, minimizing the total number of errors in each one.
- Keep only those alignments that do not exceed the specified maximum error rate.
- Then, keep only those alignments that have a maximal number of matches (that is, there is no alignment with more matches).
- If there are multiple alignments with the same number of matches, then keep only those that have the smallest error rate.
- If there are still multiple candidates left, choose the alignment that starts at the leftmost position within the read.
In Step 1, the different adapter types are taken into account: Only those overlaps that are actually allowed by the adapter type are actually considered.