Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

segmentation fault after ctg_graph was done

HippoYI opened this issue · comments

Describe the bug
I am running an assembly of about 300M genome(0.6% het rate) using a 512GB machine. The Ultralong reads is about 27X.

Error message
The program run well and get nd.asm.p.fasta after runing ctg_graph, but then the program stopped and reported segmentation fault (core dumped). This meant that the program failed to run "02.ctg_align" and "03.ctg_cns". I have tried many parameters in run.cfg and even change to a machine wit 2TB memory, but the error still occurred at the same point.

Input data
Total base count=8358015912bp, sequencing depth=27X, average/N50 read length=100709

Config file
[General]
job_type = local
job_prefix = nextDenovo
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 2
input_type = raw
read_type = ont
input_fofn = input.fofn
workdir = 01_rundir

[correct_option]
read_cutoff = 1k
genome_size = 300m
sort_options = -m 40g -t 5
minimap2_options_raw = -t 5
pa_correction = 5
correction_options = -p 4

[assemble_option]
minimap2_options_cns = -t 5
nextgraph_options = -a 1 -q 10

Operating system
CentOS Linux release 7.9.2009

GCC

Python
Python 2.7.5 and Python 3.6.2

NextDenovo
2.5.0

As the FAQ mentioned that nd.asm.p.fasta contains more structural & base errors than nd.asm.fasta, so I really want to solve this. Any ideas or suggestions on how to fix this problem?

Thank you!

Could you share the failed subtask log here?

I posted the running log and the **.e file in "ctg_graph1" directory which point to the last and the failed subtask. I am not sure that's what you need. If not, please let me know.
nextDenovo.sh.e.txt
pid6864.log.txt

See the instructions below:
Error message
Paste the complete log message, include the main task log and failed subtask log.
The main task log is usually located in your working directory and is named pidXXX.log.info and the main task log will tell you the failed subtask log in the last few lines, such as:

[ERROR] 2020-07-01 11:06:57,184 cns_align failed: please check the following logs:
[ERROR] 2020-07-01 11:06:57,185 ~/NextDenovo/test_data/01_rundir/02.cns_align/02.cns_align.sh.work/cns_align0/nextDenovo.sh.e

As I didn't save the running situation at the screen last time, I rerun the program in the last 2 days. As you can see in the "snapshot.jpg", the subtask did not give any error message, just "Segmentation fault (core dumped)" after ctg_graph was done.

snapshot

Hi,
Acutally, you don't have to rerun the whole process, just see here to continue running unfinished tasks.

For the segmentation falut, I guess this is caused by the calgs function in the file lib/kit.py, so you can replace this function with the following python code:

def calgs(infile):
	from Bio import SeqIO
	gs = 0
	for seq_record in SeqIO.parse(infile, "fasta"):
		gs += len(seq_record.seq)
	return gs

Hi, I replaced the calgs function in kit.py, and got these info:

[56473 INFO] 2022-09-07 15:27:58 skip step: db_split
[56473 INFO] 2022-09-07 15:27:58 skip step: raw_align
[56473 INFO] 2022-09-07 15:27:58 skip step: sort_align
[56473 INFO] 2022-09-07 15:27:58 skip step: seed_cns
[56473 INFO] 2022-09-07 15:27:58 seed_cns finished, and final corrected reads file:
[56473 INFO] 2022-09-07 15:27:58 /data/yixin/projects/JH_genome_analysis/New_genome_assembly_related/NextD-assembly/./01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta
[56473 INFO] 2022-09-07 15:27:58 skip step: cns_align
[56473 INFO] 2022-09-07 15:27:58 skip step: ctg_graph
Segmentation fault (core dumped)

oo, so, Next, try to change this line total_seed_len = cal_total_seed_len(get_seed_files(idx=True)) in file nextDenovo to total_seed_len =1000 and this line minlen = cal_minlen_from_idx(part_idx_files, len(part_idx_files), gs * mindepth - total_seed_len) in file nextDenovo to minlen = 2000

wow, great! ... It worked after changing those two lines, and now I can finally get the "nd.asm.fasta". I am just curious about the changes, will it affect the final contigs corrections when the total seed length was fixed to 1000?

For your data, it should not.

Thanks so much. I really appreciate your help in resolving this !