EXC_BAD_ACCESS

Question

EXC_BAD_ACCESS

armintoepfer opened this issue 2 years ago · comments

I'm running into an issue that I can't produce if I just give it one sequence pair...

wfa::WFAlignerGapAffine2Pieces aligner(4, 4, 2, 24, 1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010026c28b libwfa.2.1.0.dylib`wavefronts_backtrace_del2_ext(wf_aligner=0x0000000128808010, score=22455, k=-34) at wavefront_backtrace.c:184:20
   181 	  if (score < 0) return WAVEFRONT_OFFSET_NULL;
   182 	  wavefront_t* const d2wavefront = wf_aligner->wf_components.d2wavefronts[score];
   183 	  if (d2wavefront != NULL &&
-> 184 	      d2wavefront->lo <= k+1 &&
   185 	      k+1 <= d2wavefront->hi) {
   186 	    return BACKTRACE_PIGGYBACK_SET(d2wavefront->offsets[k+1],backtrace_D2_ext);
   187 	  } else {

ASAN/UBSAN gives something else...

../subprojects/wfa/wavefront/wavefront_extend.c:116:20: runtime error: load of misaligned address 0x00010ce54c33 for type 'uint64_t', which requires 8 byte alignment
0x00010ce54c33: note: pointer points here
 3f  3f 3f 3f 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 54 47 47  41 41 47 47 47 43 54
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:116:38: runtime error: load of misaligned address 0x00010ce5784a for type 'uint64_t', which requires 8 byte alignment
0x00010ce5784a: note: pointer points here
 21 21  21 21 54 54 47 43 43 54  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:13: runtime error: load of misaligned address 0x00010d82ff8a for type 'uint64_t', which requires 8 byte alignment
0x00010d82ff8a: note: pointer points here
 47 54  43 41 47 47 47 54 43 43  54 47 54 47 47 41 41 47  47 47 43 54 47 54 41 41  54 41 47 41 47 47
              ^
../subprojects/wfa/wavefront/wavefront_extend.c:124:31: runtime error: load of misaligned address 0x00010d8305e4 for type 'uint64_t', which requires 8 byte alignment
0x00010d8305e4: note: pointer points here
  47 54 43 41 47 47 47 54  43 43 54 47 54 47 47 41  41 47 47 47 43 41 54 54  54 43 41 54 41 47 47 47

You can try to reproduce with https://github.com/armintoepfer/clr-align-challenge and then

lldb -- ./cas ../data/long.txt

Armin Töpfer · Answer 1 · Wed Apr 20 2022 19:56:31 GMT+0800 (China Standard Time)

It's likely UB. Another phenotype is aborting with

[WFA::Backtrace] Wrong type trace.2

Erik Garrison · Answer 2 · Wed Apr 20 2022 20:00:24 GMT+0800 (China Standard Time)

I had some difficulty with the C++ API. In my case the key issue seemed to be the initialization of the reduction parameter. For the time being I'm using the C API and setting the reduction explicitly.

…

On Wed, Apr 20, 2022, 13:56 Armin Töpfer ***@***.***> wrote: It's likely UB. Another phenotype is aborting with [WFA::Backtrace] Wrong type trace.2 — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEJG7Z2AYOZBJMQ6BDLVF7WHTANCNFSM5T3Y7WNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Armin Töpfer · Answer 3 · Wed Apr 20 2022 20:07:42 GMT+0800 (China Standard Time)

Another UB hit

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../subprojects/wfa/wavefront/wavefront_extend.c:116:38 in
../subprojects/wfa/system/mm_allocator.c:400:66: runtime error: addition of unsigned offset to 0x632000000800 overflowed to 0x6320000007f8

@ekg do you have a code snippet to reproduce dual affine-gap in C?

Erik Garrison · Answer 4 · Wed Apr 20 2022 21:14:26 GMT+0800 (China Standard Time)

https://github.com/vcflib/vcflib/blob/master/src/Variant.cpp#L2158 from here Let us know if this fixes what you're seeing.

…

On Wed, Apr 20, 2022, 14:07 Armin Töpfer ***@***.***> wrote: Another UB hit SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../subprojects/wfa/wavefront/wavefront_extend.c:116:38 in ../subprojects/wfa/system/mm_allocator.c:400:66: runtime error: addition of unsigned offset to 0x632000000800 overflowed to 0x6320000007f8 @ekg <https://github.com/ekg> do you have a code snippet to reproduce dual affine-gap in C? — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQELVUMWR3IKKPVLYY6LVF7XRTANCNFSM5T3Y7WNQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Armin Töpfer · Answer 5 · Wed Apr 20 2022 23:37:09 GMT+0800 (China Standard Time)

Okay, I added it https://github.com/armintoepfer/aligner-testbed/blob/main/src/main.cpp#L168-L215

Anything obviously broken during my copy/paste?

Santiago Marco-Sola · Answer 6 · Wed Apr 20 2022 23:57:30 GMT+0800 (China Standard Time)

~~Add attributes.heuristic.strategy = wf_heuristic_none; if you want to compute the optimal/exact alignment (no heuristics).~~

I can see you already set

wavefront_aligner_set_heuristic_none(wf_aligner);

Santiago Marco-Sola · Answer 7 · Thu Apr 21 2022 01:06:53 GMT+0800 (China Standard Time)

We have successfully executed the code on a 2017-MacBook Air (Intel i5-5350U) running a Monterey 12.3.1.

$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns 
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true 
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us

The problem you are experiencing seems to be related to unaligned memory accesses during the extend()/LCP() computation. We optimize this function by comparing input-sequence blocks of 64-bits (8 characters) at a time. This optimization requires unaligned memory access. To be able to help you better, Can you let us know the machine/core you are using to run the benchmark?

In any case, make sure that compiling native you are executing the binaries in the same platform they were compiled for. If that is not the case, you will have to compile WFA2-lib forbidding ARM unaligned memory access -mno-unaligned-access, but it will have a penalty on performance.

Lastly, note that short executions might not be representative. See a flame-graph on the WFA2-lib execution for short-sequences where most of the time is invested in the initial allocation and final deallocation.

Also, note that WFA is running in exact mode here. You could even obtain better performance using adaptive mode.

Let us know.
Cheers,

Armin Töpfer · Answer 8 · Thu Apr 21 2022 01:32:35 GMT+0800 (China Standard Time)

First of all, great to hear that you could run it.

I'm using a standard x86 i7 in my iMac with the latest apple clang and gcc11. The issue is independent of march. Can you try running with multiple rounds? Maybe under a debugger?

It does not happen with the C API directly.

The C API call is also slower than the C++ version. Any idea?

The way we map and align is similar to the minimap2 approach. Alignment of very short sequences has been working great so far. Do you think alignment of full 20kb vs 20kb CLR with WFA will be faster than first mapping, cutting into small regions, and then alignment? I can obviously try, but maybe you have done that study already.

Armin Töpfer · Answer 9 · Thu Apr 21 2022 03:06:25 GMT+0800 (China Standard Time)

Food for thought, I've added data/clr1.txt that contains one pair of two full-length subreads

$ ./at ../data/clr1.txt
| 20220420 19:05:25.732 | INFO | Number of sequence pairs : 1
| 20220420 19:05:26.902 | INFO | miniwfa time  : 1s 169ms
| 20220420 19:05:30.368 | INFO | WFA2 C time   : 3s 465ms
| 20220420 19:05:30.505 | INFO | WFA2 C++ time : 137ms 217us
| 20220420 19:05:30.528 | INFO | KSW2 time     : 22ms 675us

Santiago Marco-Sola · Answer 10 · Thu Apr 21 2022 05:55:37 GMT+0800 (China Standard Time)

Ok, I've tried on an Intel i7-6500U (Ubuntu 18.04) and I couldn't reproduce the unaligned memory problem.
But I can elaborate on the use-cases (starting with short-seqs):

These are, indeed, short sequences and both KSW2 and WFA perform pretty fast:

=> KSW2
| 20220420 19:50:51.040 | INFO | Number of sequence pairs : 24670
| 20220420 19:50:52.766 | INFO | KSW2 time : 69us 969ns

In all cases, the measurements are really small. I have profiled the case of WFA and we spend a substantial amount doing bookkeeping (e.g., reaping internal buffers). I guess we could do better if we focus on these cases. But, for the time being, for these short sequences, KSW2 has the upper hand against the exact-WFA (being the execution times so small).

Note that, comparing CIGARs (using the penalties you provided), for 77.1% of the pairs, WFA returns a better score/CIGAR. I'm not aware of the band size used for KSW2. But this aspect might be interesting to explore (and how suboptimal alignments might affect the results of the downstream analyses). Perhaps, it's not relevant to get the exact optimal in these cases.

Santiago Marco-Sola · Answer 11 · Thu Apr 21 2022 06:16:20 GMT+0800 (China Standard Time)

For the long:

I refer to the previous results.

$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns 
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true 
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us

I believe that the newest biWFA could do even better. We could also check how close to the optimal KSW2 cigars are.

Santiago Marco-Sola · Answer 12 · Thu Apr 21 2022 06:58:01 GMT+0800 (China Standard Time)

Then, for the clr1:

We have 2 sequences of length 18779 and 18956, aligning at edit distance 3645 (e~19%). Seems that there are no big indels, but the error is distributed along with the sequences.

Compared to the exact-WFA, KSW2 does a pretty good job and returns the correct/optimal alignment. Considering this case in particular, the exact-WFA is forced to explore a lot of the DP-matrix:

Meanwhile, using the adaptive mode:

wavefront_aligner_set_heuristic_wfadaptive(wf_aligner,10,50,1);

This is a good example of a sequence that is not particularly favourable to the WFA. In any case, comparable time using heuristics (I guess that in the playground of heuristics we could tune it and do better, as KSW2 could too) and 6x slower calculating the optimal CIGAR.

I think we can take it from here and optimize those cases of your interest.