Asking for the ‘read_cutoff’ and 'seed_cutoff' parameters
Hans-zhao831 opened this issue · comments
Hi, Dr. Hu, thanks for developing such a powerful genome assembly software. Over the past half month, I've found that NextDenovo is the best in assembling the plant species I studied. The result made me so happy. But considering I haven't much experience in using the software, I'm still trying to obtain the best assembly results by adjusting the parameters, and in the process I have encountered some problems, so I would like to ask you for advice.
Before the consultation, I'll give you a quick overview of the project: PacBio data, ~110x raw data, diploid plants, 2g of genome size, 0.7% of heterozygosity, ~60% of repeat sequences, and nextDenovo v2.4.0.
Summary of raw data
Category | data |
---|---|
Base Num | 228,333,067,000 |
Reads Num | 18,120,500 |
>=2 K Reads Num | 90% |
>=5 k Reads Num | 75% |
>=7 k Reads Num | 66% |
>= 10 k Reads Num | 53% |
>=13 k Reads Num | 42% |
>=15 k Reads Num | 35% |
Mean Length | 12k |
N50 | 17k |
Middle length | 11k |
1. How to detect the best parameters of read_cutoff and seed_cutoff, and their combinations ?
I obtained 4 versions based on different seed_cutoff and rest same parameters (read_cutoff=10k).
Run | seed_cutoff (seed_depth) | contig N50(M) | contig Num | contig size (G) |
---|---|---|---|---|
run1 | 19436 (50) | 13.71 | 421 | 1.95 |
run2 | 20000 | 14.66 | 403 | 1.95 |
run3 | 20553 (45) | 14.29 | 391 | 1.95 |
run4 | 24645 (30) | 12.22 | 491 | 1.95 |
I also obtained 2 versions based on the two read_cutoff and rest same parameter (seed_cutoff=20k).
Run | read_cutoff | contig N50(M) | contig Num | contig size (G) |
---|---|---|---|---|
run2 | 10k | 14.66 | 403 | 1.95 |
run5 | 1k | 10.48 | 816 | 2.00 |
run2.cfg
[General]
job_type = sge
job_prefix = nextDenovo
task = all
rewrite = no
deltmp = yes
rerun =
parallel_jobs = 20
input_type = raw
input_fofn = run.fofn
read_type = clr
workdir = 01_rundir
cluster_options = auto
[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2g
blocksize = 3g
pa_correction = 20
seed_cutfiles = 10
sort_options = -m 20g -t 80 -k 40
minimap2_options_raw = -x ava-pb -t 80
correction_options = -p 80
[assemble_option]
random_round = 50
minimap2_options_cns = -x ava-pb -t 80 -k17 -w17
minimap2_options_map = -t 80
nextgraph_options = -a 1
Based on the above results, I confirm that seed-cutoff and read-cutoff have a big impact on the final assemble quality. However, I confused how to find the best value for each and the best combination of the two?
2. How can quickly obtain the final result after a few parameter changes without running from beginning to end.
Currently, I have to re-run the software from beginning to end after each parameter change, which takes a long time. Is there a way to quickly get the final result by modifying only one or a few parameters?
I look forward to your suggestions, and please don't hesitate to let me know if you need additional information.
Thanks for your feedback!
- You can use seq_stat to calculate
seed_cutoff
, and the-d
in seq_stat can usually be set to30-45
, so you need to try different values, and I don’t have a better suggestion, if I have a better value, I will set it as the default. - If you change
read_cutoff
orseed_cutoff
, you need to run it from beginning to end. If you changenextgraph_options
, just run the main task again, NextDenovo will rerun the assembly step only.
Thanks for your reply.
Based on your experience, could you please provide a strategy for finding these optimal values (seed_cutoff and read_cutoff).
For example,
- can the -f in seq_stat be considered as read_cutoff?
- are these two values distributed in linear or non-linear way?
- do we test the optimal value of each parameter individually, or need to consider different combinations of these two parameters or other parameters?
- yes
- Not test
- Different combinations