Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

long time for assemling genome

ejlladkw opened this issue · comments

Describe the bug
The 4 days were used to generate corrected reads,but three weeks were not enough to assemble the genome.

Genome characteristics
3.1G, heterozygous rate, 60%...

Input data
Total base count: 206 Gb , sequencing depth : 70x: , averag read length:11 k...

Config file
Please paste the complete content of the Config file (run.cfg) to here.
[General]
job_type = local # local, slurm, sge, pbs, lsf
job_prefix = nextDenovo_lamprey_jing
task = all # all, correct, assemble
rewrite = yes # yes/no
deltmp = yes
parallel_jobs = 23 # number of tasks used to run in parallel
input_type = raw # raw, corrected
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = 01_rundir

[correct_option]
read_cutoff = 5k
#seed_cutoff = 22650 bp
genome_size = 3.0g # estimated genome size
sort_options = -m 60g -t 11
minimap2_options_raw = -t 11
pa_correction = 23 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 11

[assemble_option]
minimap2_options_cns = -t 11
nextgraph_options = -a 1

see https://nextdenovo.readthedocs.io/en/latest/OPTION.html for a detailed introduction about all the parameters

Operating system
Which operating system and version are you using?
You can use the command lsb_release -a to get it.
centos 7.1.el8_5

GCC
What version of GCC are you using?
You can use the command gcc -v to get it.
gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)

Python
What version of Python are you using?
You can use the command python --version to get it.
Python 3.11.0

NextDenovo
What version of NextDenovo are you using?
You can use the command nextDenovo -v to get it.
nextDenovo 2.5.2

To Reproduce (Optional)
Steps to reproduce the behavior. Providing a minimal test dataset on which we can reproduce the behavior will generally lead to quicker turnaround time!

Additional context (Optional)
Add any other context about the problem here.

Please provide more log info, otherwise I can't see where the problem is?

[1969810 INFO] 2023-03-07 22:25:20 NextDenovo start...
[1969810 INFO] 2023-03-07 22:25:20 version:v2.5.0 logfile:pid1969810.log.info
[1969810 WARNING] 2023-03-07 22:25:20 Re-write workdir
[1969810 INFO] 2023-03-07 22:25:20 mkdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir
[1969810 INFO] 2023-03-07 22:25:20 mkdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align
[1969810 INFO] 2023-03-07 22:25:20 mkdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align
[1969810 INFO] 2023-03-07 22:25:20 mkdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/03.ctg_graph
[1969810 INFO] 2023-03-07 22:25:25 Total jobs: 1
[1969810 INFO] 2023-03-07 22:25:25 Submitted jobID:[1969924] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/01.db_stat.sh.work/db_stat1/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-07 22:41:04 db_stat done
[1969810 INFO] 2023-03-07 22:41:04 updated options:
rerun: 3
task: all
deltmp: 1
rewrite: 1
read_type: ont
job_type: local
input_type: raw
read_cutoff: 1k
parallel_jobs: 32
seed_depth: 29.35
pa_correction: 32
seed_cutfiles: 33
genome_size: 2.4g
seed_cutoff: 10000
blocksize: 6013613746
ctg_cns_options: -p 30
nextgraph_options: -a 1
sort_options: -m 20g -t 30 -k 27
minimap2_options_map: -x map-ont
job_prefix: nextDenovo_hagfish_jirou
minimap2_options_raw: -t 38 -x ava-ont
correction_options: -p 30 -max_lq_length 10000 -min_len_seed 5000
workdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir
input_fofn: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/input.fofn
minimap2_options_cns: -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000
raw_aligndir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align
cns_aligndir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align
ctg_graphdir: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/03.ctg_graph
[1969810 INFO] 2023-03-07 22:41:04 summary of input data:
file:�[35m /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/input.reads.stat �[0m
[Read length stat]
Types Count (#) Length (bp)
N10 49301 115661
N20 138288 72379
N30 263012 56843
N40 415340 47664
N50 595553 40372
N60 810205 33412
N70 1077609 25836
N80 1446548 17529
N90 2051508 9268

Types Count (#) Bases (bp) Depth (X)
Raw 4542367 79290734691 33.04
Filtered 512168 287315113 0.12
Clean 4030199 79003419578 32.92

Suggested seed_cutoff (genome size: 2400.00Mb, expected seed depth: 45, real seed depth: 29.35): 10000 bp
[1969810 INFO] 2023-03-07 22:41:09 Total jobs: 1
[1969810 INFO] 2023-03-07 22:41:09 Submitted jobID:[1997142] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/02.db_split.sh.work/db_split1/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-07 22:57:38 db_split done
[1969810 INFO] 2023-03-07 22:57:38 Total jobs: 627
[1969810 INFO] 2023-03-07 22:57:38 Submitted jobID:[2147521] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align0/raw_align001/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-07 22:57:39 Submitted jobID:[2147527] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align0/raw_align002/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-07 22:57:39 Submitted jobID:[2147536] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/01.raw_align/03.raw_align.sh.work/raw_align0/raw_align003/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-09 00:30:32 Submitted jobID:[930404] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns32/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-09 02:21:01 Submitted jobID:[1085301] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns33/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-09 02:52:46 seed_cns done
[1969810 INFO] 2023-03-09 02:52:46 seed_cns finished, and final corrected reads file:
[1969810 INFO] 2023-03-09 02:52:46 �[35m /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns
/cns.fasta �[0m
[1969810 INFO] 2023-03-09 02:52:46 Total jobs: 561
[1969810 INFO] 2023-03-09 02:52:46 Submitted jobID:[1118579] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align001/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-29 14:17:47 Submitted jobID:[585614] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align382/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-29 15:22:06 Submitted jobID:[598233] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align383/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 INFO] 2023-03-29 17:41:44 Submitted jobID:[618358] jobCmd:[/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align384/nextDenovo_hagfish_jirou.sh] in the local_cycle.
[1969810 WARNING] 2023-03-29 18:20:07 Accepted a killed signal and killing all running jobs, please wait...
[1969810 WARNING] 2023-03-29 18:20:08 Killed all running jobs done
[1969810 WARNING] 2023-03-29 18:20:08 Exit!

Could you paste the content of some log files /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align*/nextDenovo_hagfish_jirou.sh.e to here?

hostname

  • hostname
    cd /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align311
  • cd /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align311
    ( time /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns12/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns14/cns.fasta -o cns.filt.dovt.ovl; )
  • /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns12/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns14/cns.fasta -o cns.filt.dovt.ovl
    [M::mm_idx_gen::85.0040.36] collected minimizers
    [M::mm_idx_gen::103.302
    0.43] sorted minimizers
    [M::main::103.3020.43] loaded/built the index for 10385 target sequence(s)
    [M::mm_mapopt_update::106.841
    0.43] mid_occ = 791
    [M::mm_idx_stat] kmer size: 17; skip: 17; is_hpc: 0; #seq: 10385
    [M::mm_idx_stat::108.9440.42] distinct minimizers: 26474604 (74.35% are singletons); average occurrences: 2.972; average spacing: 7.697
    [M::worker_pipeline::127251.781
    7.96] mapped 7262 sequences
    [M::worker_pipeline::192840.088*7.87] mapped 3142 sequences
    [M::main] Version: 2.17-r941
    [M::main] CMD: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 -o cns.filt.dovt.ovl /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns12/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns14/cns.fasta
    [M::main] Real time: 192843.065 sec; CPU: 1517043.805 sec; Peak RSS: 28.613 GB

real 3214m5.035s
user 25271m1.791s
sys 13m2.492s
touch /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align311/nextDenovo_hagfish_jirou.sh.done

  • touch /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align1/cns_align311/nextDenovo_hagfish_jirou.sh.done

The log file doest not match the input confilg file, it seems you have changed some parameters when runing. And I need more logs to figure out what has happened. But anyway you can try to increase the value of -k -w and set --mode 1 in minimap2_options_cns (these changes may increase more assembly errors, but it will theoretically significantly reduce the running time) .

I apologize for providing you with an incorrect input log file. Here is the correct log file:
[General]
job_type = local # local, slurm, sge, pbs, lsf
job_prefix = nextDenovo_hagfish_jirou
task = all # all, correct, assemble
rewrite = yes # yes/no
deltmp = yes
parallel_jobs = 32 # number of tasks used to run in parallel
input_type = raw # raw, corrected
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = 01_rundir

[correct_option]
read_cutoff = 1k
genome_size = 2.4g # estimated genome size
sort_options = -m 20g -t 30
minimap2_options_raw = -t 38
pa_correction = 33 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 30

[assemble_option]
minimap2_options_cns = -t 38
nextgraph_options = -a 1

see https://nextdenovo.readthedocs.io/en/latest/OPTION.html for a detailed introduction about all the parameters

Here are more log files in /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align296:
hostname

  • hostname
    cd /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align296
  • cd /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align296
    ( time /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns11/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns21/cns.fasta -o cns.filt.dovt.ovl; )
  • /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns11/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns21/cns.fasta -o cns.filt.dovt.ovl
    [M::mm_idx_gen::66.0030.37] collected minimizers
    [M::mm_idx_gen::81.384
    0.43] sorted minimizers
    [M::main::81.3840.43] loaded/built the index for 10477 target sequence(s)
    [M::mm_mapopt_update::84.031
    0.43] mid_occ = 789
    [M::mm_idx_stat] kmer size: 17; skip: 17; is_hpc: 0; #seq: 10477
    [M::mm_idx_stat::86.2390.42] distinct minimizers: 26885347 (73.51% are singletons); average occurrences: 2.986; average spacing: 7.707
    [M::worker_pipeline::102578.686
    7.87] mapped 7330 sequences
    [M::worker_pipeline::157887.521*7.71] mapped 3117 sequences
    [M::main] Version: 2.17-r941
    [M::main] CMD: /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jing/NextDenovo/bin/minimap2-nd -I 6G --step 2 --dual=yes -t 38 -x ava-ont -k 17 -w 17 --minlen 1000 --maxhan1 5000 -o cns.filt.dovt.ovl /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns11/cns.fasta /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns21/cns.fasta
    [M::main] Real time: 157888.311 sec; CPU: 1217345.089 sec; Peak RSS: 23.071 GB

real 2631m30.862s
user 20283m58.920s
sys 5m6.807s
touch /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align296/nextDenovo_hagfish_jirou.sh.done

  • touch /disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/cns_align296/nextDenovo_hagfish_jirou.sh.done

I found that most of the time was spent on this step (/disk2/dz/08.hagish_lamprey/02.nextdenovo/hagfish_jirou/01_rundir/02.cns_align/02.cns_align.sh.work), and it's still not completed after approximately three weeks.