Decreased N50 with higher sequencing depth

Question

Decreased N50 with higher sequencing depth

nadegeguiglielmoni opened this issue 3 years ago · comments

Nadège Guiglielmoni commented 3 years ago

Hello,

I have been running some tests with NextDenovo 2.2 on one genome for which I have high coverages of PacBio and Nanopore reads. For both datasets separately, I tried subsampling the reads to different sequencing depths (10X, 20X... 100X). I found that at a 40-50X I would have the highest N50, but then with higher sequencing depths the N50 decreased. As the species is diploid with variable levels of heterozygosity, including some regions with high levels of heterozygosity, my hypothesis is that a higher sequencing depth gives more support to alternative haplotypes, and leads to breaks in the assembly. Could you give me some insights?

Hu Jiang · Answer 1 · Wed Mar 03 2021 08:31:17 GMT+0800 (China Standard Time)

Hi, could you provide your config files?
BTW, you should update to the latest version.

Nadège Guiglielmoni · Answer 2 · Wed Mar 03 2021 22:51:35 GMT+0800 (China Standard Time)

We have updated NextDenovo for future projects.

Here is the config file:

[General]
job_type = local
job_prefix = ND_ont
task = assemble # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 10
parallel_jobs = 10
input_type = raw
input_fofn = ./input.fofn
workdir = ./run

[assemble_option]
minimap2_options_raw = -x ava-ont -t 10
random_round = 20
minimap2_options_cns = -x ava-ont -t 8 -k17 -w17
nextgraph_options = -a 1
seed_cutoff = HereSeedCutoff

Hu Jiang · Answer 3 · Wed Mar 03 2021 23:14:03 GMT+0800 (China Standard Time)

How about the seed_cutoff value for different depths?

Nadège Guiglielmoni · Answer 4 · Thu Mar 04 2021 00:06:40 GMT+0800 (China Standard Time)

We set it to 1001.

Hu Jiang · Answer 5 · Thu Mar 04 2021 09:12:07 GMT+0800 (China Standard Time)

OK, I think this may be the core of the problem，you can try to calculate seed_cutoff value using bin/seq_stat. see #103 . Usually, the assembly quality is affected by the reads length, not the depth.

Nadège Guiglielmoni · Answer 6 · Thu Mar 04 2021 18:50:40 GMT+0800 (China Standard Time)

Ok thank you, I will try optimizing the seed cutoffs.

Nadège Guiglielmoni · Answer 7 · Mon Mar 08 2021 18:40:34 GMT+0800 (China Standard Time)

Hello,

We ran the assemblies again with more adapter seed cutoffs. For PacBio assemblies, there is little change. For Nanopore assemblies, there is still a drop in N50 at 60X. The N50 is better for assemblies at 80X and 100X, but the BUSCO score is drastically decreased compared to previous assemblies.

Hu Jiang · Answer 8 · Tue Mar 09 2021 09:46:41 GMT+0800 (China Standard Time)

Thanks for your feedback, the assembly quality is not simply linear with the depth and length of the input data, it also depends on the characteristics of the genome. But, the BUSCO score should be similar, so could you share more details (assembly options and busco values) about the BUSCO score is drastically decreased compared to previous assemblies..

Nadège Guiglielmoni · Answer 9 · Tue Mar 09 2021 19:16:11 GMT+0800 (China Standard Time)

Hello,

The parameters were the same as before, except for seed cutoff.

Here are the results I had before with Nanopore reads:
40X: N50 = 11.5-14.5 Mb, single BUSCOs = 312-388, duplicated BUSCOs = 12-24
50X: N50 = 11.0-13.8 Mb, single BUSCOs = 362-393, duplicated BUSCOs = 14-27
60X: N50 = 4.7-8.1 Mb, single BUSCOs = 668-685, duplicated BUSCOs = 79-98
80X: N50 = 4.1-10.1 Mb, single BUSCOs = 665-695, duplicated BUSCOs = 78-91
100X: N50 = 2.6-7.0 Mb, single BUSCOs = 663-683, duplicated BUSCOs = 80-105

And here are the results with an "improved" seed cutoff:
40X: N50 = 11.6-14.7 Mb, single BUSCOs = 319-392, duplicated BUSCOs = 12-24
50X: N50 = 10.8-14.8 Mb, single BUSCOs = 348-386, duplicated BUSCOs = 19-23
60X: N50 = 6.0-8.8 Mb, single BUSCOs = 674-694, duplicated BUSCOs = 72-87
80X: N50 = 10.0-13.7 Mb, single BUSCOs = 362-398, duplicated BUSCOs = 19-31
100X: N50 = 10.7-12.4 Mb, single BUSCOs = 404-420, duplicated BUSCOs = 25-43

Hu Jiang · Answer 10 · Thu Mar 11 2021 14:44:59 GMT+0800 (China Standard Time)

Hi， Could you provide the estimated genome size and assembly size? Do you randomly subsample reads or just select the top longest reads?