Canu 參數(shù)調(diào)整
For all stages: 所有階段
-
rawErrorRateis the maximum expected difference in an alignment of two uncorrected reads. It is a meta-parameter that sets other parameters.
設(shè)置兩個未糾錯overlap reads之間最大期望差異, 一般不用調(diào)整 -
correctedErrorRateis the maximum expected difference in an alignment of two corrected reads. It is a meta-parameter that sets other parameters. (If you’re used to theerrorRateparameter, multiply that by 3 and use it here.)
在兩個修正的reads之間的重疊的允許差異,用錯誤分?jǐn)?shù)表示。這個參數(shù)需要在組裝時多次調(diào)整。提高糾錯率將增加運(yùn)行時間,同樣,降低糾錯率將會減少運(yùn)行時間,但會有丟失重疊和破壞組裝的風(fēng)險。PacBio的默認(rèn)值為0.045, Nanopore默認(rèn)為0.144。
對于低覆蓋率的數(shù)據(jù)集(小于30X),我們建議將糾正率提高 0.01左右;
對于高覆蓋率的數(shù)據(jù)集(超過60X),我們建議將糾正率降低 0.01左右。 -
minReadLengthandminOverlapLength. The defaults are to discard reads shorter than 1000bp and to not look for overlaps shorter than 500bp. IncreasingminReadLengthcan improve run time, and increasingminOverlapLengthcan improve assembly quality by removing false overlaps. However, increasing either too much will quickly degrade assemblies by either omitting valuable reads or missing true overlaps.
最小reads長度和最小Overlap長度,提高minReadLength可以提高運(yùn)行速度,增加minOverlapLength可以降低假陽性的overlap。 -
minReadLength最小reads長度,默認(rèn)1000,一定要比minOverlapLength大。如果設(shè)置足夠高,gatekeeper模塊將聲稱輸入中有錯誤,因?yàn)樘嗟妮斎雛eads已經(jīng)被丟棄。不過只要有足夠的覆蓋度,這就不是問題。 -
minOverlapLength最小Overlap長度,默認(rèn)500,一定要比minReadLength小。 較小的值可以用來克服reads覆蓋度的不足,但也會導(dǎo)致錯誤的重疊和潛在的錯誤組裝。較大的值將導(dǎo)致更多正確的組裝,但會產(chǎn)生更多的碎片。 -
genomeSize對基因組大小的估計,例如3.7m或2.8g?;蚪M大小估計用于決定需要糾正多少reads(通過corOutCoverage參數(shù)),以及mhap overlapper應(yīng)該有多敏感(通過mhapSensitivity參數(shù))。它還會影響一些日志記錄,特別是N50大小的報告。
For correction: 糾錯階段
-
corOutCoveragecontrols how much coverage in corrected reads is generated. The default is to target 40X, but, for various reasons, this results in 30X to 35X of reads being generated.
控制在已糾錯的reads中生成的覆蓋度,默認(rèn)的目標(biāo)是40X,但是由于各種原因,這會生成30X到35X的reads
-
corMinCoverage, loosely, controls the quality of the corrected reads. It is the coverage in evidence reads that is needed before a (portion of a) corrected read is reported. Corrected reads are generated as a consensus of other reads; this is just the minimum coverage needed for the consensus sequence to be reported. The default is based on input read coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that.
控制校正reads的質(zhì)量(0,4)
For assembly: 拼接階段
-
utgOvlErrorRateis essentially a speed optimization. Overlaps above this error rate are not computed. Setting it too high generally just wastes compute time, while setting it too low will degrade assemblies by missing true overlaps between lower quality reads.
速度優(yōu)化,一般無需調(diào)整 -
utgGraphDeviationandutgRepeatDeviationwhat quality of overlaps are used in contig construction or in breaking contigs at false repeat joins, respectively. Both are in terms of a deviation from the mean error rate in the longest overlaps.
不調(diào)整 -
utgRepeatConfusedBPcontrols how similar a true overlap (between two reads in the same contig) and a false overlap (between two reads in different contigs) need to be before the contig is split. When this occurs, it isn’t clear which overlap is ‘true’ - the longer one or the slightly shorter one - and the contig is split to avoid misassemblies.
不調(diào)整
For polyploid genomes: 對于多倍體基因組
Generally, there’s a couple of ways of dealing with the ploidy.
Avoid collapsing the genome so you end up with double (assuming diploid) the genome size as long as your divergence is above about 2% (for PacBio data). Below this divergence, you’d end up collapsing the variations. We’ve used the following parameters for polyploid populations (PacBio data): 避免基因組塌縮。因此,只要差異在2%以上(對于PacBio數(shù)據(jù)),基因組的大小就會翻倍(假設(shè)是二倍體);若差異在2% 以下,則會把這些變異折疊起來。我們對多倍體種群使用了以下參數(shù)(PacBio數(shù)據(jù))
corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"
This will output more corrected reads (than the default 40x). The latter option will be more conservative at picking the error rate to use for the assembly to try to maintain haplotype separation. If it works, you’ll end up with an assembly >= 2x your haploid genome size. Post-processing using gene information or other synteny information is required to remove redundancy from this assembly.
這將輸出更多的糾正reads(默認(rèn)40x), 后一項(xiàng)參數(shù)在選擇用于組裝以盡量保持單倍型分離的錯誤率方面更為保守。如果成功,你將得到一個裝配體>= 2倍單倍體基因組大小。使用基因信息或其他同步信息的后處理需要從這個組裝中去除冗余pug_dups。Smash haplotypes together and then do phasing using another approach (like HapCUT2 or whatshap or others). In that case you want to do the opposite, increase the error rates used for finding overlaps:
將單倍型粉碎在一起(不推薦)
corOutCoverage=200 correctedErrorRate=0.15
When trimming, reads will be trimmed using other reads in the same chromosome (and probably some reads from other chromosomes). When assembling, overlaps well outside the observed error rate distribution are discarded.
We strongly recommend option 1 which will lead to a larger than expected genome size. See My genome size and assembly size are different, help! for details on how to remove this duplication.
我們通常傾向于選項(xiàng)1,這將導(dǎo)致比預(yù)期更大的基因組大小。我們已經(jīng)(在有限的測試中)成功地使用了pug_dups 去除冗余。
For metagenomes:
The basic idea is to use all data for assembly rather than just the longest as default. The parameters we’ve used recently are:
corOutCoverage=10000 corMhapSensitivity=high corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200
For low coverage:
For less than 30X coverage, increase the alllowed difference in overlaps by a few percent (from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16% (or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction. Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.
For high coverage:
For more than 60X coverage, decrease the allowed difference in overlaps (from 4.5% to 4.0% with correctedErrorRate=0.040 for PacBio, from 14.4% to 12% with correctedErrorRate=0.12 for Nanopore), so that only the better corrected reads are used. This is primarily an optimization for speed and generally does not change assembly continuity.