Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Asking for the ‘read_cutoff’ and 'seed_cutoff' parameters

Hans-zhao831 opened this issue · comments

commented

Hi, Dr. Hu, thanks for developing such a powerful genome assembly software. Over the past half month, I've found that NextDenovo is the best in assembling the plant species I studied. The result made me so happy. But considering I haven't much experience in using the software, I'm still trying to obtain the best assembly results by adjusting the parameters, and in the process I have encountered some problems, so I would like to ask you for advice.

Before the consultation, I'll give you a quick overview of the project: PacBio data, ~110x raw data, diploid plants, 2g of genome size, 0.7% of heterozygosity, ~60% of repeat sequences, and nextDenovo v2.4.0.

Summary of raw data

Category data
Base Num 228,333,067,000
Reads Num 18,120,500
>=2 K Reads Num 90%
>=5 k Reads Num 75%
>=7 k Reads Num 66%
>= 10 k Reads Num 53%
>=13 k Reads Num 42%
>=15 k Reads Num 35%
Mean Length 12k
N50 17k
Middle length 11k

1. How to detect the best parameters of read_cutoff and seed_cutoff, and their combinations ?

I obtained 4 versions based on different seed_cutoff and rest same parameters (read_cutoff=10k).

Run seed_cutoff (seed_depth) contig N50(M) contig Num contig size (G)
run1 19436 (50) 13.71 421 1.95
run2 20000 14.66 403 1.95
run3 20553 (45) 14.29 391 1.95
run4 24645 (30) 12.22 491 1.95

I also obtained 2 versions based on the two read_cutoff and rest same parameter (seed_cutoff=20k).

Run read_cutoff contig N50(M) contig Num contig size (G)
run2 10k 14.66 403 1.95
run5 1k 10.48 816 2.00

run2.cfg

[General]
job_type = sge 
job_prefix = nextDenovo 
task = all 
rewrite = no 
deltmp = yes 
rerun = 
parallel_jobs = 20 
input_type = raw 
input_fofn = run.fofn
read_type = clr
workdir = 01_rundir
cluster_options = auto

[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2g
blocksize = 3g   
pa_correction = 20
seed_cutfiles = 10 
sort_options = -m 20g -t 80 -k 40
minimap2_options_raw = -x ava-pb -t 80
correction_options = -p 80

[assemble_option]
random_round = 50
minimap2_options_cns = -x ava-pb -t 80 -k17 -w17
minimap2_options_map = -t 80
nextgraph_options = -a 1

Based on the above results, I confirm that seed-cutoff and read-cutoff have a big impact on the final assemble quality. However, I confused how to find the best value for each and the best combination of the two?

2. How can quickly obtain the final result after a few parameter changes without running from beginning to end.

Currently, I have to re-run the software from beginning to end after each parameter change, which takes a long time. Is there a way to quickly get the final result by modifying only one or a few parameters?

I look forward to your suggestions, and please don't hesitate to let me know if you need additional information.

Thanks for your feedback!

  1. You can use seq_stat to calculate seed_cutoff, and the -d in seq_stat can usually be set to 30-45, so you need to try different values, and I don’t have a better suggestion, if I have a better value, I will set it as the default.
  2. If you change read_cutoff or seed_cutoff , you need to run it from beginning to end. If you change nextgraph_options, just run the main task again, NextDenovo will rerun the assembly step only.
commented

Thanks for your reply.

Based on your experience, could you please provide a strategy for finding these optimal values (seed_cutoff and read_cutoff).
For example,

  1. can the -f in seq_stat be considered as read_cutoff?
  2. are these two values distributed in linear or non-linear way?
  3. do we test the optimal value of each parameter individually, or need to consider different combinations of these two parameters or other parameters?
  1. yes
  2. Not test
  3. Different combinations