EXC_BAD_ACCESS
armintoepfer opened this issue · comments
I'm running into an issue that I can't produce if I just give it one sequence pair...
wfa::WFAlignerGapAffine2Pieces aligner(4, 4, 2, 24, 1, wfa::WFAligner::Alignment, wfa::WFAligner::MemoryHigh);
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010026c28b libwfa.2.1.0.dylib`wavefronts_backtrace_del2_ext(wf_aligner=0x0000000128808010, score=22455, k=-34) at wavefront_backtrace.c:184:20
181 if (score < 0) return WAVEFRONT_OFFSET_NULL;
182 wavefront_t* const d2wavefront = wf_aligner->wf_components.d2wavefronts[score];
183 if (d2wavefront != NULL &&
-> 184 d2wavefront->lo <= k+1 &&
185 k+1 <= d2wavefront->hi) {
186 return BACKTRACE_PIGGYBACK_SET(d2wavefront->offsets[k+1],backtrace_D2_ext);
187 } else {
ASAN/UBSAN gives something else...
../subprojects/wfa/wavefront/wavefront_extend.c:116:20: runtime error: load of misaligned address 0x00010ce54c33 for type 'uint64_t', which requires 8 byte alignment
0x00010ce54c33: note: pointer points here
3f 3f 3f 3f 54 47 43 43 54 47 54 43 41 47 47 47 54 43 43 54 47 54 54 47 47 41 41 47 47 47 43 54
^
../subprojects/wfa/wavefront/wavefront_extend.c:116:38: runtime error: load of misaligned address 0x00010ce5784a for type 'uint64_t', which requires 8 byte alignment
0x00010ce5784a: note: pointer points here
21 21 21 21 54 54 47 43 43 54 47 54 43 41 47 47 47 54 43 43 54 47 54 47 47 41 41 47 47 47 43 41
^
../subprojects/wfa/wavefront/wavefront_extend.c:124:13: runtime error: load of misaligned address 0x00010d82ff8a for type 'uint64_t', which requires 8 byte alignment
0x00010d82ff8a: note: pointer points here
47 54 43 41 47 47 47 54 43 43 54 47 54 47 47 41 41 47 47 47 43 54 47 54 41 41 54 41 47 41 47 47
^
../subprojects/wfa/wavefront/wavefront_extend.c:124:31: runtime error: load of misaligned address 0x00010d8305e4 for type 'uint64_t', which requires 8 byte alignment
0x00010d8305e4: note: pointer points here
47 54 43 41 47 47 47 54 43 43 54 47 54 47 47 41 41 47 47 47 43 41 54 54 54 43 41 54 41 47 47 47
You can try to reproduce with https://github.com/armintoepfer/clr-align-challenge and then
lldb -- ./cas ../data/long.txt
It's likely UB. Another phenotype is aborting with
[WFA::Backtrace] Wrong type trace.2
Another UB hit
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../subprojects/wfa/wavefront/wavefront_extend.c:116:38 in
../subprojects/wfa/system/mm_allocator.c:400:66: runtime error: addition of unsigned offset to 0x632000000800 overflowed to 0x6320000007f8
@ekg do you have a code snippet to reproduce dual affine-gap in C?
Okay, I added it https://github.com/armintoepfer/aligner-testbed/blob/main/src/main.cpp#L168-L215
Anything obviously broken during my copy/paste?
Add attributes.heuristic.strategy = wf_heuristic_none;
if you want to compute the optimal/exact alignment (no heuristics).
I can see you already set
wavefront_aligner_set_heuristic_none(wf_aligner);
We have successfully executed the code on a 2017-MacBook Air (Intel i5-5350U) running a Monterey 12.3.1.
$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us
The problem you are experiencing seems to be related to unaligned memory accesses during the extend()/LCP() computation. We optimize this function by comparing input-sequence blocks of 64-bits (8 characters) at a time. This optimization requires unaligned memory access. To be able to help you better, Can you let us know the machine/core you are using to run the benchmark?
In any case, make sure that compiling native
you are executing the binaries in the same platform they were compiled for. If that is not the case, you will have to compile WFA2-lib forbidding ARM unaligned memory access -mno-unaligned-access
, but it will have a penalty on performance.
Lastly, note that short executions might not be representative. See a flame-graph on the WFA2-lib execution for short-sequences where most of the time is invested in the initial allocation and final deallocation.
Also, note that WFA is running in exact mode here. You could even obtain better performance using adaptive
mode.
Let us know.
Cheers,
First of all, great to hear that you could run it.
I'm using a standard x86 i7 in my iMac with the latest apple clang and gcc11. The issue is independent of march. Can you try running with multiple rounds? Maybe under a debugger?
It does not happen with the C API directly.
The C API call is also slower than the C++ version. Any idea?
The way we map and align is similar to the minimap2 approach. Alignment of very short sequences has been working great so far. Do you think alignment of full 20kb vs 20kb CLR with WFA will be faster than first mapping, cutting into small regions, and then alignment? I can obviously try, but maybe you have done that study already.
Food for thought, I've added data/clr1.txt
that contains one pair of two full-length subreads
$ ./at ../data/clr1.txt
| 20220420 19:05:25.732 | INFO | Number of sequence pairs : 1
| 20220420 19:05:26.902 | INFO | miniwfa time : 1s 169ms
| 20220420 19:05:30.368 | INFO | WFA2 C time : 3s 465ms
| 20220420 19:05:30.505 | INFO | WFA2 C++ time : 137ms 217us
| 20220420 19:05:30.528 | INFO | KSW2 time : 22ms 675us
Ok, I've tried on an Intel i7-6500U (Ubuntu 18.04) and I couldn't reproduce the unaligned memory problem.
But I can elaborate on the use-cases (starting with short-seqs):
These are, indeed, short sequences and both KSW2 and WFA perform pretty fast:
=> KSW2
| 20220420 19:50:51.040 | INFO | Number of sequence pairs : 24670
| 20220420 19:50:52.766 | INFO | KSW2 time : 69us 969ns
=> Exact WFA
| 20220420 19:48:48.163 | INFO | Number of sequence pairs : 24670
| 20220420 19:48:58.666 | INFO | WFA2 C time : 425us 734ns
In all cases, the measurements are really small. I have profiled the case of WFA and we spend a substantial amount doing bookkeeping (e.g., reaping internal buffers). I guess we could do better if we focus on these cases. But, for the time being, for these short sequences, KSW2 has the upper hand against the exact-WFA (being the execution times so small).
Note that, comparing CIGARs (using the penalties you provided), for 77.1% of the pairs, WFA returns a better score/CIGAR. I'm not aware of the band size used for KSW2. But this aspect might be interesting to explore (and how suboptimal alignments might affect the results of the downstream analyses). Perhaps, it's not relevant to get the exact optimal in these cases.
For the long:
I refer to the previous results.
$> ./at ../data/long.txt --miniwfa=false --wfa2=true --ksw2=false
| 20220420 16:53:37.514 | INFO | Number of sequence pairs : 2301
| 20220420 16:53:38.368 | INFO | WFA2 time 370us 648ns
$> ./at ../data/long.txt --miniwfa=false --wfa2=false --ksw2=true
| 20220420 16:57:01.806 | INFO | Number of sequence pairs : 2301
| 20220420 16:57:12.184 | INFO | KSW2 time 4ms 509us
I believe that the newest biWFA could do even better. We could also check how close to the optimal KSW2 cigars are.
Then, for the clr1:
We have 2 sequences of length 18779 and 18956, aligning at edit distance 3645 (e~19%). Seems that there are no big indels, but the error is distributed along with the sequences.
| 20220420 22:19:21.555 | INFO | Number of sequence pairs : 1
| 20220420 22:19:21.590 | INFO | KSW2 time : 52ms 313us
| 20220420 22:20:18.136 | INFO | Number of sequence pairs : 1
| 20220420 22:20:18.490 | INFO | WFA2 C time : 354ms 531us
Compared to the exact-WFA, KSW2 does a pretty good job and returns the correct/optimal alignment. Considering this case in particular, the exact-WFA is forced to explore a lot of the DP-matrix:
Meanwhile, using the adaptive
mode:
wavefront_aligner_set_heuristic_wfadaptive(wf_aligner,10,50,1);
./at ../data/clr1.txt --miniwfa=false --wfa2-c=true --wfa2-cpp=false --ksw2=false --rounds 20
| 20220420 22:47:13.780 | INFO | Number of sequence pairs : 1
| 20220420 22:47:14.762 | INFO | WFA2 C time : 49ms 92us
This is a good example of a sequence that is not particularly favourable to the WFA. In any case, comparable time using heuristics (I guess that in the playground of heuristics we could tune it and do better, as KSW2 could too) and 6x slower calculating the optimal CIGAR.
I think we can take it from here and optimize those cases of your interest.