Asking for the ‘read_cutoff’ and 'seed_cutoff' parameters

Question

Asking for the ‘read_cutoff’ and 'seed_cutoff' parameters

Hans-zhao831 opened this issue 3 years ago · comments

Hi, Dr. Hu, thanks for developing such a powerful genome assembly software. Over the past half month, I've found that NextDenovo is the best in assembling the plant species I studied. The result made me so happy. But considering I haven't much experience in using the software, I'm still trying to obtain the best assembly results by adjusting the parameters, and in the process I have encountered some problems, so I would like to ask you for advice.

Before the consultation, I'll give you a quick overview of the project: PacBio data, ~110x raw data, diploid plants, 2g of genome size, 0.7% of heterozygosity, ~60% of repeat sequences, and nextDenovo v2.4.0.

Summary of raw data

Category	data
Base Num	228,333,067,000
Reads Num	18,120,500
>=2 K Reads Num	90%
>=5 k Reads Num	75%
>=7 k Reads Num	66%
>= 10 k Reads Num	53%
>=13 k Reads Num	42%
>=15 k Reads Num	35%
Mean Length	12k
N50	17k
Middle length	11k

1. How to detect the best parameters of read_cutoff and seed_cutoff, and their combinations ?

I obtained 4 versions based on different seed_cutoff and rest same parameters (read_cutoff=10k).

Run	seed_cutoff (seed_depth)	contig N50(M)	contig Num	contig size (G)
run1	19436 (50)	13.71	421	1.95
run2	20000	14.66	403	1.95
run3	20553 (45)	14.29	391	1.95
run4	24645 (30)	12.22	491	1.95

I also obtained 2 versions based on the two read_cutoff and rest same parameter (seed_cutoff=20k).

Run	read_cutoff	contig N50(M)	contig Num	contig size (G)
run2	10k	14.66	403	1.95
run5	1k	10.48	816	2.00

run2.cfg

[General]
job_type = sge 
job_prefix = nextDenovo 
task = all 
rewrite = no 
deltmp = yes 
rerun = 
parallel_jobs = 20 
input_type = raw 
input_fofn = run.fofn
read_type = clr
workdir = 01_rundir
cluster_options = auto

[correct_option]
read_cutoff = 10k
seed_cutoff = 20k
genome_size = 2g
blocksize = 3g   
pa_correction = 20
seed_cutfiles = 10 
sort_options = -m 20g -t 80 -k 40
minimap2_options_raw = -x ava-pb -t 80
correction_options = -p 80

[assemble_option]
random_round = 50
minimap2_options_cns = -x ava-pb -t 80 -k17 -w17
minimap2_options_map = -t 80
nextgraph_options = -a 1

Based on the above results, I confirm that seed-cutoff and read-cutoff have a big impact on the final assemble quality. However, I confused how to find the best value for each and the best combination of the two?

2. How can quickly obtain the final result after a few parameter changes without running from beginning to end.

Currently, I have to re-run the software from beginning to end after each parameter change, which takes a long time. Is there a way to quickly get the final result by modifying only one or a few parameters?

I look forward to your suggestions, and please don't hesitate to let me know if you need additional information.

Hu Jiang · Answer 1 · Fri Feb 05 2021 09:37:19 GMT+0800 (China Standard Time)

Thanks for your feedback!

You can use seq_stat to calculate seed_cutoff, and the -d in seq_stat can usually be set to 30-45, so you need to try different values, and I don’t have a better suggestion, if I have a better value, I will set it as the default.
If you change read_cutoff or seed_cutoff , you need to run it from beginning to end. If you change nextgraph_options, just run the main task again, NextDenovo will rerun the assembly step only.

Hans · Answer 2 · Fri Feb 05 2021 11:46:41 GMT+0800 (China Standard Time)

Thanks for your reply.

Based on your experience, could you please provide a strategy for finding these optimal values (seed_cutoff and read_cutoff).
For example,

can the -f in seq_stat be considered as read_cutoff?
are these two values distributed in linear or non-linear way?
do we test the optimal value of each parameter individually, or need to consider different combinations of these two parameters or other parameters?

Hu Jiang · Answer 3 · Fri Feb 05 2021 13:04:46 GMT+0800 (China Standard Time)

yes
Not test
Different combinations