smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation Fault with memory_mode != high

h-2 opened this issue · comments

I get a segmentation fault on the current development branch with this code:

extern "C" {
#include <wavefront/wavefront_align.h>
}

#include <string>
#include <fstream>

void do_align(std::string_view const pattern, std::string_view const ref)
{
  // Configure alignment attributes
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.alignment_scope = compute_alignment;
  attributes.distance_metric = gap_affine;
  attributes.affine_penalties.mismatch = 3;
  attributes.affine_penalties.gap_opening = 5;
  attributes.affine_penalties.gap_extension = 1;
  attributes.alignment_form.span = alignment_endsfree;
  attributes.alignment_form.pattern_begin_free = 0;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 1;
  attributes.alignment_form.text_end_free = 1;
  attributes.heuristic.strategy = wf_heuristic_wfadaptive;
  attributes.heuristic.min_wavefront_length = 10;
  attributes.heuristic.max_distance_threshold = 50;
  attributes.heuristic.steps_between_cutoffs = 1;
  attributes.memory_mode = wavefront_memory_high; // <---- change me to low or med

  wavefront_aligner_t * const wf_aligner = wavefront_aligner_new(&attributes);

  int res = wavefront_align(wf_aligner, pattern.data(), pattern.size(), ref.data(), ref.size());

  cigar_print_pretty(stderr, pattern.data(), pattern.size(), ref.data(), ref.size(),
                     wf_aligner->cigar, wf_aligner->mm_allocator);
  fprintf(stderr,"Alignment Score %d\nResult:%d\n", wf_aligner->cigar->score, res);

  assert(res != -1);
  wavefront_aligner_delete(wf_aligner);
}

int main()
{
    std::ifstream qrystr{"qry.txt"};
    std::string qry{std::istreambuf_iterator<char>(qrystr),
                    std::istreambuf_iterator<char>()};
    std::ifstream refstr{"ref.txt"};
    std::string ref{std::istreambuf_iterator<char>(refstr),
                    std::istreambuf_iterator<char>()};

    do_align(qry, ref);
}

Files:
https://hauswedell.net/tmp/ref.txt
https://hauswedell.net/tmp/qry.txt

edit: also on the main branch

Hi,

You want to align 2 sequences of ~250Kbp each (alignment-score=-384231). This is a challenging alignment even using heuristics (due to memory constraints, hence the segmentation fault). You must use the BiWFA algorithm.

  attributes.alignment_form.span = alignment_end2end;
  attributes.alignment_form.pattern_begin_free = 0;
  attributes.alignment_form.pattern_end_free = 0;
  attributes.alignment_form.text_begin_free = 0;
  attributes.alignment_form.text_end_free = 0;
  attributes.memory_mode = wavefront_memory_ultralow; // <- IMPORTANT

Note that I still haven't implemented the endsfree mode on the BiWFA. I am halfway done with that.

Thanks for your quick reply!

You want to align 2 sequences of ~250Kbp each (alignment-score=-384231). This is a challenging alignment even using heuristics

My thought was that since it is an "extend-alignment", it will probably not go that far anyway..?

(due to memory constraints, hence the segmentation fault)

When I run out of memory in memory_mode::high, the application either throws an alloc_error (if it couldn't allocate), or it is killed by the operating system shortly after the alloc. But this does look different based on the error.

Also, I am a bit confused: to my understanding memory_mode::high means something like "you can use more memory if it means being faster". And memory_mode::low (or med) mean "use less memory (or as little as possible)". Doesn't that mean that anything that succeeds in memory_mode::high should also succeed (albeit slower) in lower memory modes?

edit: I know for certain that some alignments fail/crash on my computer in memory_mode::high but succeed in memory_mode::med . If there are other alignments (like the one above) that succeed in ::high but fail in ::med... how shall I decide on which to use? 😆

I see your point. But if I have two sequences of 250Kbp, even if you let me stop whenever I reach an end (i.e., ends-free), I need to reach a 250K end (horizontally or vertically; query or pattern). Otherwise, I need to know what condition tells me to stop. If you can narrow down what you have in mind, I'm sure we can ping-point what you want.

Also, I am a bit confused: to my understanding memory_mode::high means something like "you can use more memory if it means being faster". And memory_mode::low (or med) mean "use less memory (or as little as possible)". Doesn't that mean that anything that succeeds in memory_mode::high should also succeed (albeit slower) in lower memory modes?

Ok, some background on the WFA. The first WFA (high) was thought to align relatively short sequences (up to 1K bases). Then people started to use it with longer sequences (up to 10K-20K), and the med and low strategies were implemented to save some memory at the cost of going slightly slower. Then, people wanted to align 250K or even 2MB sequences, so we developed the BiWFA (ultralow) that actually runs in linear memory O(s) and has the same time complexity as the original WFA O(ns). The algorithm is very cool and, in practice, is almost as fast as the original WFA (sometimes even faster).

So, BiWFA uses very little memory and is very fast. Actually, med and low strategies have little purpose now that the BiWFA exist and the original WFA (high) it is only useful for short sequences (up to ~1K bases).

TL;DR Use BiWFA if you are aligning large sequences.