Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nextpolish2.py never stop

yangfangyuan0102 opened this issue · comments

Describe the bug
Hi, Dear Author,
After all ONT reads were mapped, I run nextpolish2.py, but this running never stop even if all CPUs and MEM have stopped working.
It is able to output some polished contigs. But it's incomplete. It looks like stucking at one contig. But this may not be a problem with my genome assembly, as I tried several genomes.
I'm sure my step is no problem because it worked fine on my previous machine (CPU AMD 5900X).

Operating system
Ubuntu 22.10
CPU intel 13900K. This CPU cores are heterogeneous, does it matter?

GCC
I have reinstalled nextpolish1.4.1 using conda.
python3.10

Thanks very much

Try to kill the main task, and rerun.

Try to kill the main task, and rerun.

Hi, Dr. Hu,
Thanks for your quick reply. Rerun nextpolish2.py will skip some corrected contigs, but will stuck as where I stuck before. The task "sleep" again. The modified time of output file was reflashed, but the file size had no change.

Maybe you have encountered a bug, so could you extract the unfinished sequence and its corresponding bam file and send it to me? so I can reproduce this bug and fix it.

The ONT reads were corrected by illumina reads first of all. I use the corrected reads as input for nextdenovo and polish. Is this a cause?

Please download these files in 7 days:
链接:https://pan.baidu.com/s/1xKnUyq-2tmtRskofy-4u3w

Hi,

I may have the same problem also with nextpolish2.py. I made my own custom alignment using minimap2, and used samtools to filter and index the BAM. I'm working using 32 cores, trying to polish a huge genome of 26 Gb using ONT reads, with a window size of 100M.

    ls `pwd`/${bam_file} > ${bam_file}.fofn
    nextpolish2.py -g ${genome_file} -l ${bam_file}.fofn -r ont \\
        -p ${task.cpus} -w ${window_size}M -sp False -o ${file_name}.nextpolish_polished.no-splitting.fa

The process outputs a FASTQ file of 20 Gb after 4 hours, but keeps running up to 12 hours, and the size of the output never changed.

[70014 INFO] 2023-06-28 11:35:06 Corrected step options:
[70014 INFO] 2023-06-28 11:35:06 
split:                        0
auto:                         True
block:                        None
process:                      32
read_type:                    1
block_index:                  all
uppercase:                    False
window:                       100000000
alignment_score_ratio:        0.8
alignment_identity_ratio:     0.8
bam_list:                     purgedClipped.ntLink_round1_k56.w1000.a1.gap_fill.sorted.merged.bam.fofn
genome:                       Amex6.0-purgedClipped_contigs.k56.w1000.z1000.ntLink_round1_k56.w1000.a1.gap_fill.fa
out:                          Amex6.0-purgedClipped_contigs.k56.w1000.z1000.ntLink_round1_k56.w1000.a1.gap_fill.nextpolish_polished.no-splittin
g.fa

Aditionally, the log has this warning repeated a tons of times:

[70014 WARNING] 2023-06-28 11:37:14 Adjust -p from 32 to 32, -w from 100000000 to 5000000, logical CPUs:152, available RAM:~1416G, use -a to di
sable automatic adjustment.
[109706 INFO] 2023-06-28 11:41:34 Start a corrected worker in 109706 from parent 70014
python: ctg_cns.c:2787: find_sup_alns: Assertion `i != sup_aln->i' failed.
python: ctg_cns.c:2787: find_sup_alns: Assertion `i != sup_aln->i' failed.
[W::hts_idx_load2] The index file is older than the data file: /scratch-cbe/users/salvador.gonzales/1_AmexGenomeAnnotation/0_AmexGenomeUpgrade/2_HiC_Scaffolding/97/ffdd9a7e1b0ee1aa080c55b2a6353c/purgedClipped.ntLink_round1_k56.w1000.a1.gap_fill.sorted.merged.bam.csi

@yangfangyuan0102 were you able to solve the issue?

Kind regards,
Salvador

@SalvadorGJ

Hi, I still don't know how this happened. My tips are: make sure the genome comes from the raw ONT reads you used for polishing. That is, the genome and ONT are preferably without additional modifications before polishing. You also need to ensure that raw ONT reads are correctly mapped to the genome, e.g., using minimap2 -x map-ont. I'm guessing that extra processing of BAM files can also cause problems, as you said "filter the BAM".
Best wishes.