Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

segmentation fault building ctg_graph using NEXTDENOVO/2.4.0

gitcruz opened this issue · comments

Describe the bug
I am running an assembly of 1.7G heterozygous genome (1.2% het rate) using a 2TB machine. The ONT data is 50x of the highest quality (used Filtlong ≥5Kb and 150Gb)

1st config file (24cpus 1TB total RAM):
[General]
job_type = local
task = all
rewrite = yes
parallel_jobs = 4
deltmp = yes
read_type = ont
input_type = raw
workdir = /WORKDIR/
input_fofn = /WORKDIR/long_reads.fofn
[correct_option]
read_cutoff = 1k
genome_size = 1.8g
seed_depth = 45
seed_cutoff = 0
blocksize = 1g
pa_correction = 4
minimap2_options_raw = -t 6 -x ava-ont
sort_options = -m 40g -t 20
correction_options = -p 6

[assemble_option]
minimap2_options_cns = -t 6 -x ava-ont -k17 -w17
minimap2_options_map = -t 6 -x ava-ont
nextgraph_options = -a 1

2nd config file (48cpus 2TB total RAM):
[General]
job_type = local
task = all
rewrite = yes
parallel_jobs = 8
deltmp = yes
read_type = ont
input_type = raw
workdir = /WORKDIR/
input_fofn = /WORKDIR/long_reads.fofn

[correct_option]
read_cutoff = 1k
genome_size = 1.8g
seed_depth = 45
seed_cutoff = 0
blocksize = 1g
pa_correction = 4
minimap2_options_raw = -t 6 -x ava-ont
sort_options = -m 40g -t 20
correction_options = -p 6

[assemble_option]
minimap2_options_cns = -t 6 -x ava-ont -k17 -w17
minimap2_options_map = -t 6 -x ava-ont
nextgraph_options = -a 1

Error message
After 10 days the assembly failed I/O error at the 02.cns_align step (see fosrt config). I removed this folder and resubmitted the assembly with more memory (2nd config). It went smoothly but now constantly failing at the ctg_graph step. the error is this:
hostname

  • hostname
    cd /WORKDIR/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph0
  • cd /WORKDIR/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph0
    time /apps/NEXTDENOVO/2.4.0/bin/nextgraph -a 1 -f /WORKDIR/03.ctg_graph/01.ctg_graph.input.seqs /WORKDIR/03.ctg_graph/01.ctg_graph.input.ovls -o nd.asm.p.fasta;
  • /apps/NEXTDENOVO/2.4.0/bin/nextgraph -a 1 -f /WORKDIR/03.ctg_graph/01.ctg_graph.input.seqs /WORKDIR/03.ctg_graph/01.ctg_graph.input.ovls -o nd.asm.p.fasta
    [INFO] 2021-12-03 19:11:48 Initialize graph and reading...
    /WORKDIR/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph0/nextDenovo.sh: line 5: 19296 Segmentation fault /apps/NEXTDENOVO/2.4.0/bin/nextgraph -a 1 -f /WORKDIR/03.ctg_gr
    aph/01.ctg_graph.input.seqs /WORKDIR/03.ctg_graph/01.ctg_graph.input.ovls -o nd.asm.p.fasta

Genome characteristics
C-value =1.7Gb
Paste here the genomescope results:
GenomeScope version 2.0
input file = jf_21mer.hist
output directory = out/21mer/
p = 2
k = 21

property min max
Homozygous (aa) 98.7068% 98.7307%
Heterozygous (ab) 1.26928% 1.29316%
Genome Haploid Length 1,208,134,973 bp 1,210,345,670 bp
Genome Repeat Length 399,334,371 bp 400,065,090 bp
Genome Unique Length 808,800,602 bp 810,280,580 bp
Model Fit 73.122% 95.132%
Read Error Rate 0.214032% 0.214032%

Input data
[Read length stat]
Types Count (#) Length (bp)
N10 266461 29793
N20 648378 23529
N30 1113845 19774
N40 1660889 16968
N50 2295837 14643
N60 3032994 12575
N70 3896295 10664
N80 4925021 8844
N90 6190301 7021

Types Count (#) Bases (bp) Depth (X)
Raw 7860332 100000021650 55.56
Filtered 0 0 0.00
Clean 7860332 100000021650 55.56

Config file
Last config used was:
[General]
job_type = local
task = all
rewrite = yes
parallel_jobs = 8
deltmp = yes
read_type = ont
input_type = raw
workdir = /WORKDIR/
input_fofn = /WORKDIR/long_reads.fofn

[correct_option]
read_cutoff = 1k
genome_size = 1.8g
seed_depth = 45
seed_cutoff = 0
blocksize = 1g
pa_correction = 4
minimap2_options_raw = -t 6 -x ava-ont
sort_options = -m 40g -t 40
correction_options = -p 6

[assemble_option]
minimap2_options_cns = -t 6 -x ava-ont -k17 -w17
minimap2_options_map = -t 6 -x ava-ont
nextgraph_options = -a 1

Operating system

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-
4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.7 (Santiago)
Release: 6.7
Codename: Santiago

GCC
gcc version 6.3.0 (GCC)

Python
Python 3.8.6

NextDenovo
nextDenovo v2.4.0

To Reproduce (Optional)
Steps to reproduce the behavior. Providing a minimal test dataset on which we can reproduce the behavior will generally lead to quicker turnaround time!

Additional context (Optional)

I made three attempts and error is always: line 5: 19296 Segmentation fault /apps/NEXTDENOVO/2.4.0/bin/nextgraph
any idea on what the problem could be?
I'll be happy to check some intermediate files.

The files in 01.ctg_graph.input.ovls are not empty their sizes range 43M to 195M in the folder 02.cns_alig/*.cns.filt.dovt.ovl

Input_seqs also are there:

for i in $(cat 03.ctg_graph/01.ctg_graph.input.seqs); do ls -sh $i; done
4.3G 02.cns_align/01.seed_cns.sh.work/seed_cns0/cns.fasta
4.4G 02.cns_align/01.seed_cns.sh.work/seed_cns1/cns.fasta
4.4G 02.cns_align/01.seed_cns.sh.work/seed_cns2/cns.fasta
2.7G 02.cns_align/01.seed_cns.sh.work/seed_cns3/cns.fasta
4.4G 02.cns_align/01.seed_cns.sh.work/seed_cns4/cns.fasta

Any ideas or suggestions on how to fix this problem are welcome!

Thanks

Hi, see #113 to fix this error.