Input 輸入
[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=true
pa_fasta_filter_option=streamed-median
input_type: 可以為raw或者preads,如果指定preads,管道將跳過整個0-rawreads預組裝階段;
pa_fasta_filter_option: 默認為streamed-internal-median,用于處理一個ZMW有多條subreads時,到底選擇哪一條的問題。"pass": 不做過濾,全部要;"streamed-median": 表示選擇中等長度的subreads;"streamed-internal-median": 當一個ZMW里的subread低于3條時選擇最長,多于3條則選擇中等長度的subreads。
Data Partitioning 數(shù)據(jù)分區(qū)
# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200
# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50
這部分的設置會將參數(shù)傳遞給DBsplit,將數(shù)據(jù)進行拆分多個block,后續(xù)的運算都基于blocks,-s 控制 DB blocks的大小
如果前面設置了
pa_fasta_filter_option=pass,pa_DBsplit_option這里要加一個-a選項
Repeat Masking 屏蔽重復序列
pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300
Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run with a modified version of daligner called datander and thus uses a similar parameter set. Whatever settings you use for pre-assembly daligner overlapping in the next section (pa_daligner_option) will be used here for tandem repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask with the pa_HPCTANmask_option.
The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with thepa_REPmask_code option. The parameters needed for each iteration are both the group size and coverage specified as group,coverage pairs separated by semicolons as seen above.
For information and theory on how to set up your rounds of repeat masking, consult this blog post.
Pre-assembly 預組裝
genome_size=1000000000
seed_coverage=30
length_cutoff=-1
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480 -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False
During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the alignments. These consensus sequences are called pre-assembled reads or preads and generally have accuracy greater than 99% or QV20.
如果你想自動計算種子subreads覆蓋度,那就不用去設置 genome_size和 seed_coverage, 只需設置length_cutoff=-1即可自動計算。我們一般推薦“20-40x”種子覆蓋度。
另外,如果你不知道基因組大小,不確定seed_coverage 的大小或者如果您只想利用特定長度以上的所有reads,您可以使用length_cutoff手動設置該限制。
需要注意的是,無論
length_cutoff被設置為什么值,都是對falcon-unzip的一個限制,任何小于該截斷值的reads都不會用于phasing。對于組裝來說,除非你期望一個特定的特性,比如微染色體或短圓形質(zhì)粒,否則在設置高的length_cutoff時可能不會有什么害處。但是,如果你打算unzip,那么你就應該人為地限制你的phasing數(shù)據(jù)集,而擁有一個較低的length_cutoff可能對你有好處。大多數(shù)計算都發(fā)生在預組裝中,因此如果計算時間對您很重要,那么增加length_cutoff將提高效率,但是需要進行上述權衡。
Overlap options for daligner are set with the pa_HPCdaligner_option and pa_daligner_option flags. Previous versions of FALCON had a single parameter. This is now split into two flags, one that affects requested resources pa_HPCdaligner_optionand one that affects the overlap search pa_daligner_option. For pa_HPCdaligner_option, the -v parameter is passed to the LAsort and LAmerge programs while -B and -M parameters are passed to the daligner sub-commands.
To understand the theory and how to configure daligner see this blog post and this command reference guide.
For daligner, in general we recommend the following:
-e: average correlation rate (average sequence identity)
0.70 (low quality data) - 0.80 (high quality data). A higher value will help prevent haplotype collapse.
-l: minimum length of overlap
1000 (shorter library) - 5000 (longer library)
-k: kmer size
14 (low quality data) - 18 (high quality data)
較低的
-k值在增加磁盤空間、內(nèi)存消耗和較慢的運行時間之間具有較高的敏感性,并且在較低質(zhì)量的數(shù)據(jù)下工作得最好。相反,對于-k,較大的kmer值具有更高的特異性,使用更少的系統(tǒng)資源,運行速度更快,但是只適用于高質(zhì)量的數(shù)據(jù)
You can configure basic pre-assembly consensus calling options with the falcon_sense_option flag.
--output-multi necessary for generating proper fasta headers
--min-idt minimum alignment identity
--min-cov minimum coverage necessary
--max-n-read max number of reads for calling consensus to make the preads
By default, -fo are the parameters passed to LA4Falcon. The option falcon_sense_greedy changes this parameter set to -fog which essentially attempts to maintain relative information between reads that have been broken due to regions of low quality.
Pread overlapping 重疊
ovlp_HPCdaligner_option=-v -M24 -l500
ovlp_daligner_option=-e.96 -s1000 -h60
The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.
Recommendation for preads:
-e: average correlation rate (average sequence identity)
0.93 (inbred) - 0.96 (outbred)
-l: minimum length of overlap
1800 (poor preassembly, short/low quality library) - 6000 (long, high quality library)
-k: kmer size
18 (low quality) - 24 (most cases)
Final Assembly 最終組裝
# experimenent with "--min-idt" to collapse (98-99) or split haplotypes (up to 99.9) during contig assembly
# if you plan to unzip, collapse first using ~98, lower for very divergent haplotypes
# ignore indels looks at only substitutions in overlaps, allows higher overlap stringency to reduce repeat-induced errors
overlap_filtering_setting = --max-diff 400 --max-cov 400 --min-cov 2 --n-core 24 --min-idt 99.9 --ignore-indels
overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000
The option overlap_filter_setting allows setting criteria for filtering pread overlaps. --max-diff filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified. --max-cov filters highly represented overlaps typically caused by contaminants or repeats and --min-cov allows specification of a minimum overlap coverage.
將
--min-cov設置得太低將允許檢測到更多的重疊,代價是可能會出現(xiàn)額外的嵌合/錯誤組裝。
length_cutoff_pr is the minimum length of pre-assembled preads used for the final assembly. Typically, this value is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.
通常,將此值設置為允許在最終組裝中對corrected reads進行大約15到30倍的覆蓋度的長度。
Miscellaneous configuration options 其他選項
Additional configuration options that don't necessarily fit into one of the previous categories are described here.
target=assembly
skip_checks=False
LA4Falcon_preload=false
FALCON can be configured to stop after any of its three stages with the target flag set to either overlapping, pre-assembly or assembly. Each option will stop the pipeline at the end of its corresponding stage, 0-rawreads, 1-preads_ovlor 2-asm-falcon respectively. The default is full assembly pipeline.
The flag skip_checks disables .las file checks with LAcheck which has been known to cause errors on certain systems in the past.
選項LA4Falcon_preload將-P參數(shù)傳遞給LA4Falcon,從而將所有讀取操作加載到內(nèi)存中。在較慢的文件系統(tǒng)上,這可以顯著加快速度,但這將大大增加consensus階段的內(nèi)存需求。