Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decreased N50 with higher sequencing depth

nadegeguiglielmoni opened this issue · comments

Hello,

I have been running some tests with NextDenovo 2.2 on one genome for which I have high coverages of PacBio and Nanopore reads. For both datasets separately, I tried subsampling the reads to different sequencing depths (10X, 20X... 100X). I found that at a 40-50X I would have the highest N50, but then with higher sequencing depths the N50 decreased. As the species is diploid with variable levels of heterozygosity, including some regions with high levels of heterozygosity, my hypothesis is that a higher sequencing depth gives more support to alternative haplotypes, and leads to breaks in the assembly. Could you give me some insights?

Hi, could you provide your config files?
BTW, you should update to the latest version.

We have updated NextDenovo for future projects.

Here is the config file:

[General]
job_type = local
job_prefix = ND_ont
task = assemble # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 10
parallel_jobs = 10
input_type = raw
input_fofn = ./input.fofn
workdir = ./run

[assemble_option]
minimap2_options_raw = -x ava-ont -t 10
random_round = 20
minimap2_options_cns = -x ava-ont -t 8 -k17 -w17
nextgraph_options = -a 1
seed_cutoff = HereSeedCutoff

How about the seed_cutoff value for different depths?

We set it to 1001.

OK, I think this may be the core of the problem,you can try to calculate seed_cutoff value using bin/seq_stat. see #103 . Usually, the assembly quality is affected by the reads length, not the depth.

Ok thank you, I will try optimizing the seed cutoffs.

Hello,

We ran the assemblies again with more adapter seed cutoffs. For PacBio assemblies, there is little change. For Nanopore assemblies, there is still a drop in N50 at 60X. The N50 is better for assemblies at 80X and 100X, but the BUSCO score is drastically decreased compared to previous assemblies.

Thanks for your feedback, the assembly quality is not simply linear with the depth and length of the input data, it also depends on the characteristics of the genome. But, the BUSCO score should be similar, so could you share more details (assembly options and busco values) about the BUSCO score is drastically decreased compared to previous assemblies..

Hello,

The parameters were the same as before, except for seed cutoff.

Here are the results I had before with Nanopore reads:
40X: N50 = 11.5-14.5 Mb, single BUSCOs = 312-388, duplicated BUSCOs = 12-24
50X: N50 = 11.0-13.8 Mb, single BUSCOs = 362-393, duplicated BUSCOs = 14-27
60X: N50 = 4.7-8.1 Mb, single BUSCOs = 668-685, duplicated BUSCOs = 79-98
80X: N50 = 4.1-10.1 Mb, single BUSCOs = 665-695, duplicated BUSCOs = 78-91
100X: N50 = 2.6-7.0 Mb, single BUSCOs = 663-683, duplicated BUSCOs = 80-105

And here are the results with an "improved" seed cutoff:
40X: N50 = 11.6-14.7 Mb, single BUSCOs = 319-392, duplicated BUSCOs = 12-24
50X: N50 = 10.8-14.8 Mb, single BUSCOs = 348-386, duplicated BUSCOs = 19-23
60X: N50 = 6.0-8.8 Mb, single BUSCOs = 674-694, duplicated BUSCOs = 72-87
80X: N50 = 10.0-13.7 Mb, single BUSCOs = 362-398, duplicated BUSCOs = 19-31
100X: N50 = 10.7-12.4 Mb, single BUSCOs = 404-420, duplicated BUSCOs = 25-43

Hi, Could you provide the estimated genome size and assembly size? Do you randomly subsample reads or just select the top longest reads?