smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EXC_BAD_ACCESS

armintoepfer opened this issue · comments

I'm running into an issue that I can't produce if I just give it one sequence pair...

wfa::WFAlignerGapAffine2Pieces aligner(4, 4, 2, 24, 1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010026c28b libwfa.2.1.0.dylib`wavefronts_backtrace_del2_ext(wf_aligner=0x0000000128808010, score=22455, k=-34) at wavefront_backtrace.c:184:20
   181 	  if (score < 0) return WAVEFRONT_OFFSET_NULL;
   182 	  wavefront_t* const d2wavefront = wf_aligner->wf_components.d2wavefronts[score];
   183 	  if (d2wavefront != NULL &&
-> 184 	      d2wavefront->lo <= k+1 &&
   185 	      k+1 <= d2wavefront->hi) {
   186 	    return BACKTRACE_PIGGYBACK_SET(d2wavefront->offsets[k+1],backtrace_D2_ext);
   187 	  } else {

ASAN/UBSAN gives something else...

../subprojects/wfa/wavefront/wavefront_extend.c:116:20: runtime error: load of misaligned address 0x00010ce54c33 for type 'uint64_t', which requires 8 byte alignment
0x00010ce54c33: note: pointer points here
 3f  3f 3f 3f 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 54 47 47  41 41 47 47 47 43 54
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:116:38: runtime error: load of misaligned address 0x00010ce5784a for type 'uint64_t', which requires 8 byte alignment
0x00010ce5784a: note: pointer points here
 21 21  21 21 54 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:13: runtime error: load of misaligned address 0x00010d82ff8a for type 'uint64_t', which requires 8 byte alignment
0x00010d82ff8a: note: pointer points here
 47 54  43 41 47 47 47 54 43 43  54 47 54 47 47 41 41 47  47 47 43 54 47 54 41 41  54 41 47 41 47 47
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:31: runtime error: load of misaligned address 0x00010d8305e4 for type 'uint64_t', which requires 8 byte alignment
0x00010d8305e4: note: pointer points here
  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41 54 54  54 43 41 54 41 47 47 47

You can try to reproduce with https://github.com/armintoepfer/clr-align-challenge and then

lldb -- ./cas ../data/long.txt

It's likely UB. Another phenotype is aborting with

[WFA::Backtrace] Wrong type trace.2

Another UB hit

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../subprojects/wfa/wavefront/wavefront_extend.c:116:38 in
../subprojects/wfa/system/mm_allocator.c:400:66: runtime error: addition of unsigned offset to 0x632000000800 overflowed to 0x6320000007f8

@ekg do you have a code snippet to reproduce dual affine-gap in C?

Okay, I added it https://github.com/armintoepfer/aligner-testbed/blob/main/src/main.cpp#L168-L215

Anything obviously broken during my copy/paste?

Add attributes.heuristic.strategy = wf_heuristic_none; if you want to compute the optimal/exact alignment (no heuristics).

I can see you already set

wavefront_aligner_set_heuristic_none(wf_aligner);

We have successfully executed the code on a 2017-MacBook Air (Intel i5-5350U) running a Monterey 12.3.1.

$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns 
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true 
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us 

The problem you are experiencing seems to be related to unaligned memory accesses during the extend()/LCP() computation. We optimize this function by comparing input-sequence blocks of 64-bits (8 characters) at a time. This optimization requires unaligned memory access. To be able to help you better, Can you let us know the machine/core you are using to run the benchmark?

In any case, make sure that compiling native you are executing the binaries in the same platform they were compiled for. If that is not the case, you will have to compile WFA2-lib forbidding ARM unaligned memory access -mno-unaligned-access, but it will have a penalty on performance.

Lastly, note that short executions might not be representative. See a flame-graph on the WFA2-lib execution for short-sequences where most of the time is invested in the initial allocation and final deallocation.

pb_challenge_wfa2

Also, note that WFA is running in exact mode here. You could even obtain better performance using adaptive mode.

Let us know.
Cheers,

First of all, great to hear that you could run it.

I'm using a standard x86 i7 in my iMac with the latest apple clang and gcc11. The issue is independent of march. Can you try running with multiple rounds? Maybe under a debugger?

It does not happen with the C API directly.

The C API call is also slower than the C++ version. Any idea?

The way we map and align is similar to the minimap2 approach. Alignment of very short sequences has been working great so far. Do you think alignment of full 20kb vs 20kb CLR with WFA will be faster than first mapping, cutting into small regions, and then alignment? I can obviously try, but maybe you have done that study already.

Food for thought, I've added data/clr1.txt that contains one pair of two full-length subreads

$ ./at ../data/clr1.txt
| 20220420 19:05:25.732 | INFO | Number of sequence pairs : 1
| 20220420 19:05:26.902 | INFO | miniwfa time  : 1s 169ms
| 20220420 19:05:30.368 | INFO | WFA2 C time   : 3s 465ms
| 20220420 19:05:30.505 | INFO | WFA2 C++ time : 137ms 217us
| 20220420 19:05:30.528 | INFO | KSW2 time     : 22ms 675us

Ok, I've tried on an Intel i7-6500U (Ubuntu 18.04) and I couldn't reproduce the unaligned memory problem.
But I can elaborate on the use-cases (starting with short-seqs):

short alg

These are, indeed, short sequences and both KSW2 and WFA perform pretty fast:

=> KSW2
| 20220420 19:50:51.040 | INFO | Number of sequence pairs : 24670
| 20220420 19:50:52.766 | INFO | KSW2 time : 69us 969ns

=> Exact WFA
| 20220420 19:48:48.163 | INFO | Number of sequence pairs : 24670
| 20220420 19:48:58.666 | INFO | WFA2 C time : 425us 734ns

In all cases, the measurements are really small. I have profiled the case of WFA and we spend a substantial amount doing bookkeeping (e.g., reaping internal buffers). I guess we could do better if we focus on these cases. But, for the time being, for these short sequences, KSW2 has the upper hand against the exact-WFA (being the execution times so small).

Note that, comparing CIGARs (using the penalties you provided), for 77.1% of the pairs, WFA returns a better score/CIGAR. I'm not aware of the band size used for KSW2. But this aspect might be interesting to explore (and how suboptimal alignments might affect the results of the downstream analyses). Perhaps, it's not relevant to get the exact optimal in these cases.

For the long:

long alg

I refer to the previous results.

$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns 
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true 
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us 

I believe that the newest biWFA could do even better. We could also check how close to the optimal KSW2 cigars are.

Then, for the clr1:

We have 2 sequences of length 18779 and 18956, aligning at edit distance 3645 (e~19%). Seems that there are no big indels, but the error is distributed along with the sequences.

| 20220420 22:19:21.555 | INFO | Number of sequence pairs : 1
| 20220420 22:19:21.590 | INFO | KSW2 time : 52ms 313us

| 20220420 22:20:18.136 | INFO | Number of sequence pairs : 1
| 20220420 22:20:18.490 | INFO | WFA2 C time : 354ms 531us

Compared to the exact-WFA, KSW2 does a pretty good job and returns the correct/optimal alignment. Considering this case in particular, the exact-WFA is forced to explore a lot of the DP-matrix:

image

Meanwhile, using the adaptive mode:

wavefront_aligner_set_heuristic_wfadaptive(wf_aligner,10,50,1);

./at ../data/clr1.txt --miniwfa=false --wfa2-c=true --wfa2-cpp=false --ksw2=false --rounds 20
| 20220420 22:47:13.780 | INFO | Number of sequence pairs : 1
| 20220420 22:47:14.762 | INFO | WFA2 C time : 49ms 92us

image

This is a good example of a sequence that is not particularly favourable to the WFA. In any case, comparable time using heuristics (I guess that in the playground of heuristics we could tune it and do better, as KSW2 could too) and 6x slower calculating the optimal CIGAR.

I think we can take it from here and optimize those cases of your interest.