Intermittent failure of iree/task/queue_test on arm64

Question

Intermittent failure of iree/task/queue_test on arm64

stellaraccident opened this issue 7 months ago · comments

Seeing this intermittently on arm64 linux runners.

The following tests FAILED:
97 - iree/task/queue_test (Failed)

[ RUN      ] QueueTest.TryStealAll
/work/runtime/src/iree/task/queue_test.cc:310: Failure
Expected equality of these values:
  &task_c
    Which is: 0xffffc6ae6260
  iree_task_queue_try_steal(&source_queue, &target_queue, 1000)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:311: Failure
Expected equality of these values:
  &task_d
    Which is: 0xffffc6ae6[220](https://github.com/openxla/iree/actions/runs/6802769949/job/18496627564?pr=15470#step:6:221)
  iree_task_queue_pop_front(&target_queue)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:316: Failure
Value of: iree_task_queue_is_empty(&source_queue)
  Actual: false
Expected: true

Looks like some memory model, atomic, mumble-mumble thing.

Scott Todd · Answer 1 · Thu Nov 09 2023 03:52:51 GMT+0800 (China Standard Time)

cc @freddan80

Some discussion about this on Discord

From @bjacob :

Basically I am going to find how to trigger that arm64 CI job, and then I'll do it on a pr that causes it to run build_and_test_tsan.sh instead of the normal test script.
If anyone knows, please share here how to trigger that CI job .

https://iree.dev/developers/general/contributing/#ci-behavior-manipulation

ci-exactly: build_test_all_arm64 should do that

Ben Vanik · Answer 2 · Thu Nov 09 2023 03:55:55 GMT+0800 (China Standard Time)

weird - task_queue is just using a mutex IIRC

Benoit Jacob · Answer 3 · Thu Nov 09 2023 05:51:16 GMT+0800 (China Standard Time)

Trying at #15491

Benoit Jacob · Answer 4 · Thu Nov 09 2023 10:32:22 GMT+0800 (China Standard Time)

Results are in in #15491. All runtime tests pass with TSan. All tests under iree/task were rerun 32 times - no failure.

Benoit Jacob · Answer 5 · Thu Nov 09 2023 10:34:20 GMT+0800 (China Standard Time)

A a side note: I have run tests with TSan locally on macOS/arm64 and there, we have a bunch of TSan reports of data races in these tests. But it's a different OS and we have in places different code paths for macOS vs Linux (e.g. on futex usage), so that finding could be just false positives or could be about things not relevant to the present Issue.

Benoit Jacob · Answer 6 · Thu Nov 09 2023 10:55:24 GMT+0800 (China Standard Time)

Definitely macOS specific so not what we're observing here, but here's a fix: #15499

Benoit Jacob · Answer 7 · Thu Nov 09 2023 11:20:26 GMT+0800 (China Standard Time)

Also trying ASan at #15501

Benoit Jacob · Answer 8 · Thu Nov 09 2023 11:28:03 GMT+0800 (China Standard Time)

100% runtime tests pass also with ASan...

Summary: at this point, the intermittent here is not reproducing with either ASan or TSan (and the latter with 32 reruns) on the same arm64 CI hosts. I'm going to leave it there, having reached the end of my basic playbook... if we really care, we're going to have to reproduce that locally, e.g. by installing a linux partition on a mac, or getting shell access into the CI host...

Benoit Jacob · Answer 9 · Thu Nov 09 2023 11:42:29 GMT+0800 (China Standard Time)

Still no repro with TSan at 100 repetitions (requested 256 but apparently CTest caps at 100).

Fredrik Knutsson · Answer 10 · Thu Nov 09 2023 16:04:40 GMT+0800 (China Standard Time)

@bjacob thx for the quick analysis.

I haven't observed this issue yet on our machines (AWS - Graviton 2/3), but I'll try to run it a bunch of times and see if I can see it...

@stellaraccident how often does it occur, roughly? Is there a way for me to access those stats without having to manually clicking my way through the CI workflow runs?

Fredrik Knutsson · Answer 11 · Thu Nov 09 2023 23:11:14 GMT+0800 (China Standard Time)

I managed to reproduce this. It usually happens after a few thousand runs. Command:

IREE_CTEST_TESTS_REGEX=queue IREE_CTEST_REPEAT_UNTIL_FAIL_COUNT=100000 ./build_tools/cmake/ctest_all.sh ./iree-build-all/

on bc98b9a04b. And I'm on a 64 core Graviton 2. I haven't got to debugging it yet, but I think I'll have some time for that tomorrow.

Benoit Jacob · Answer 12 · Thu Nov 09 2023 23:34:50 GMT+0800 (China Standard Time)

Awesome!

Sanitizers might help here - you can see what I did in #15491 (for TSan) and #15501 (for ASan). Since this is in the runtime and does not depend on the compiler, you can configure like I do there - for example for TSan, these are my CMake flags: https://github.com/openxla/iree/blob/a1448a33f029b8dc8d1755142ba5c4ad2b2dd58a/build_tools/cmake/build_and_test_tsan.sh#L25-L58

Specifically:

cmake \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DIREE_ENABLE_ASSERTIONS=ON \
  -DIREE_BUILD_COMPILER=OFF \
  -DIREE_ENABLE_LLD=ON \
  -DIREE_ENABLE_TSAN=ON \

The BYTECODE_MODULE_* settings are irrelevant when IREE_BUILD_COMPILER=OFF.

Benoit Jacob · Answer 13 · Thu Nov 09 2023 23:41:52 GMT+0800 (China Standard Time)

Also note re

IREE_CTEST_TESTS_REGEX=queue IREE_CTEST_REPEAT_UNTIL_FAIL_COUNT=100000

I sometimes find that when trying to reproduce a TSan failure on a specific test with many repetitions, it pays to filter in more tests than just the one I care about, to introduce more non-determinism in the scheduling of threads. It is a problem with CTest that if you filter a single test and set many repetitions, they will run sequentially instead of in parallel -- the parallelization dimension is only "across filtered tests" and not "across repetitions". So my usual workarounds have been either (1) filter in more tests, e.g. filter iree/task not just queue, or (2) write my own testing script launching many parallel processes (in that case, since the goal is to create some scheduling chaos, it's not a bad idea to schedule more threads than the system's hardware concurrency).

Fredrik Knutsson · Answer 14 · Fri Nov 10 2023 15:39:45 GMT+0800 (China Standard Time)

👍 I'll give this a try. I'm not familiar with the tool yet but I'll read up on it

Fredrik Knutsson · Answer 15 · Fri Nov 10 2023 21:16:06 GMT+0800 (China Standard Time)

TL;DR - Seems to be related to building in the docker container. I can't reproduce the issues natively, by building on my Gravition 2 Ubuntu 22.04, only by building in the docker container. I'm a bit confused...

Some observations. I can run 100k iterations on the Graviton 2 native (Ubuntu 22.04). The issue happens when I build in the container based on:

gcr.io/iree-oss/base-arm
64@sha256:942d01a396f81bff06c5ef4643ba3c9f4500a09010cd30d9ed9be691ddaf1353

Trying to build with tsan in the docker container doesn't work for me, which is weird since I use the same docker image and your patch (building with tsan natively works strangely). When I try to build with tsan in the docker container, I get cmake complaints:

-- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
CMake Error at third_party/benchmark/CMakeLists.txt:316 (message):
  Failed to determine the source files for the regular expression backend

I get around that but adding the cmake args:

  "-DHAVE_STD_REGEX=ON"
  "-DRUN_HAVE_STD_REGEX=1"

But then I get a bunch of errors like this:

FATAL: ThreadSanitizer CHECK failed: /build/llvm-toolchain-9-sL57p3/llvm-toolchain-9-9.0.1/compiler-rt/lib/tsan/rtl/tsan_platform_linux.cc:297 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (generate_embed_data+0x2c3f48)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (generate_embed_data+0x2d9d30)
    #2 __tsan::InitializePlatform() <null> (generate_embed_data+0x2cb15c)
    #3 __tsan::Initialize(__tsan::ThreadState*) <null> (generate_embed_data+0x2b4c70)
    #4 <null> <null> (ld-linux-aarch64.so.1+0xea38)
    #5 <null> <null> (ld-linux-aarch64.so.1+0x1180)

I'm probably missing something here...

Anyways, it made me think that it has something to do with the LLVM version (container uses v9 by default). So I built using clang-14 instead of clang-9, but the issues is still there.

The only way for me to reproduce the issue natively is by building in the docker container and running ctest on the that build natively.

Benoit Jacob · Answer 16 · Fri Nov 10 2023 22:03:58 GMT+0800 (China Standard Time)

This is getting interesting! I googled the failed CHECK condition in your error log, and found this: golang/go#35547 (comment) . It gives a plausible reason why this would only happen in a container and only on ppc64 and arm64 but not x86_64, and provides a possible work-around. Can you try that?

Fredrik Knutsson · Answer 17 · Fri Nov 10 2023 22:30:25 GMT+0800 (China Standard Time)

Works! So now I can build using the thread sanitizer inside the docker container. Let's see if it catches the issue. Feels like I should have been caught it by now, but still running ... (10x slower with tsan?)

I have to go pick up kids now though. I'll check in later

Fredrik Knutsson · Answer 18 · Fri Nov 10 2023 22:35:02 GMT+0800 (China Standard Time)

For reference, the workaround (interactive):

docker run --rm -it --mount="type=bind,src=./,dst=/iree" --security-opt seccomp=unconfined gcr.io/iree-oss/base-arm64@sha256:942d01a396f81bff06c5ef4643ba3c9f4500a09010cd30d9ed9be691ddaf1353

Benoit Jacob · Answer 19 · Fri Nov 10 2023 22:49:45 GMT+0800 (China Standard Time)

Fantastic! Yes, 10x slower sounds plausible for TSan (it varies depending on the workload).

Fredrik Knutsson · Answer 20 · Sat Nov 11 2023 00:29:45 GMT+0800 (China Standard Time)

Unfortunately, I don't see the issue - it's still running 🤷

Two things I can think of doing next 1) look at the disassembly of the working / non-working queue test 2) rule out that the Ubuntu version has nothing to do with it. Will continue on Monday.

Thomas Preud'homme · Answer 21 · Tue Nov 14 2023 00:09:41 GMT+0800 (China Standard Time)

Fredrik sent me the test where iree was built inside docker Vs the one where iree was built on the host and I noticed when comparing them that the host version uses outline atomics whereas the one in docker doesn't. I would still expect it to work but since it's related to atomics and this failure is intermittent it seemed potentially significant.

While at it I was looking at the iree_task_queue_try_steal() function used in the tryStealAll test and noticed that no list splitting happens if the caller already hold the mutex. Is that expected? Finally, I've noticed that the stolen_tasks list is initialized both in iree_task_queue_try_steal and in iree_task_list_split. Perhaps the later should be guaranteed by the API and then the former can be removed?

Hopefully I'll make more progress tomorrow.

Benoit Jacob · Answer 22 · Tue Nov 14 2023 00:33:56 GMT+0800 (China Standard Time)

Humm out-of-line atomics... I wonder what would happen if we linked together armv8.0 code (using LDXR/STXR loops) and armv8.1 code (using LDADD etc single-instruction atomics). Are these 100% compatible or is that a potential cause of data races?

Ben Vanik · Answer 23 · Tue Nov 14 2023 01:30:22 GMT+0800 (China Standard Time)

iree_task_queue_try_steal could be reworked to early exit if the try fails, but I wouldn't change the API requirements of split (as that's used elsewhere) - I believe iree_task_queue_try_steal did more in the past when the lock would fail (try stealing in a variety of ways, one of which was via the lock). I don't know if I'd do any of that before figuring out the issue, though, as we have a reproducer of something that should work but isn't and we don't want to lose that.

Thomas Preud'homme · Answer 24 · Tue Nov 14 2023 05:34:36 GMT+0800 (China Standard Time)

Humm out-of-line atomics... I wonder what would happen if we linked together armv8.0 code (using LDXR/STXR loops) and armv8.1 code (using LDADD etc single-instruction atomics). Are these 100% compatible or is that a potential cause of data races?

That's where my mind was at but wouldn't that require different part of IREE runtime using different atomic instructions? I'm gonna keep digging.

Thomas Preud'homme · Answer 25 · Tue Nov 14 2023 05:35:53 GMT+0800 (China Standard Time)

iree_task_queue_try_steal could be reworked to early exit if the try fails, but I wouldn't change the API requirements of split (as that's used elsewhere) - I believe iree_task_queue_try_steal did more in the past when the lock would fail (try stealing in a variety of ways, one of which was via the lock). I don't know if I'd do any of that before figuring out the issue, though, as we have a reproducer of something that should work but isn't and we don't want to lose that.

I wasn't suggesting a change to solve this problem, just flagging a potential improvement for later on. Also regarding the double initialization I was suggesting for it to be documented as part of split, thus removing the need for the initialization in the caller.

Ben Vanik · Answer 26 · Tue Nov 14 2023 06:06:32 GMT+0800 (China Standard Time)

No worries!
In C it's always best to not rely on documentation for things like default initialization - that iree_task_list_split memset(0)'s the list does not mean we want to also not memset(0) it in the callers - compilers are better at eliding redundant memset(0)s than humans are at reading documentation and knowing when they need to do it themselves or not :)

Thomas Preud'homme · Answer 27 · Wed Nov 15 2023 00:24:26 GMT+0800 (China Standard Time)

I can reproduce the failure with the test built by @freddan80 in docker by running just queue_test in a loop. From what I could find online googletest does not execute tests in parallel so that suggests more of an uninitialized bug or use after free. However running the tool under valgrind's memcheck with --leak-check=full didn't show anything, even for runs where the test failed.

Fredrik Knutsson · Answer 28 · Wed Nov 15 2023 16:14:42 GMT+0800 (China Standard Time)

Short update. I managed to reproduce the issue by building the runtime outside the docker (on ubuntu 22.04). By using clang 12.0.1 coincidentally... Hence, this:

TL;DR - Seems to be related to building in the docker container. I can't reproduce the issues natively, by building on my Gravition 2 Ubuntu 22.04, only by building in the docker container. I'm a bit confused...

is debunked. It makes me believe the issue is generic and just happen appear randomly depending on who-knows-what. I've ran a few marathon asan and tsan tests with various permutations, but it never triggers the issue. I'll give this some more attention later today.

Benoit Jacob · Answer 29 · Wed Nov 15 2023 20:04:51 GMT+0800 (China Standard Time)

Great! Focus on TSan. Run the test in a script that runs it in parallel in many threads, not relying on ctest to parallelize, due to the above explained ctest limitation. Set a really large number of threads to make scheduling chaotic.

Benoit Jacob · Answer 30 · Wed Nov 15 2023 20:07:08 GMT+0800 (China Standard Time)

Also maybe edit the test to make it loop more iterations.

Fredrik Knutsson · Answer 31 · Thu Nov 16 2023 00:35:32 GMT+0800 (China Standard Time)

Me and @RoboTux had a joint look at this today. It seem the lock fails, which eventually is cause by this:

https://github.com/openxla/iree/blame/f031ce8ea3050c9a87afde41df253472b362c241/runtime/src/iree/base/internal/synchronization.c#L456-L473

which in extension calls this:

https://github.com/openxla/iree/blob/1012586173e74509841f004e46ec5cfa3925285b/runtime/src/iree/base/internal/atomics_clang.h#L62

which return false. By printign, we can see the mutex is not taken. IIUC __c11_atomic_compare_exchange_weak is allowed to fail spuriously (ref). Now, if I replace that call with __c11_atomic_compare_exchange_strong, the test pass 100% of the time.

The asm of __c11_atomic_compare_exchange_weak could look something like this depending on compiler, version etc.:

0000000000242250 <iree_slim_mutex_try_lock>:
  242250:       885ffc08        ldaxr   w8, [x0]
  242254:       34000068        cbz     w8, 242260 <iree_slim_mutex_try_lock+0x10>
  242258:       d5033f5f        clrex
  24225c:       14000004        b       24226c <iree_slim_mutex_try_lock+0x1c>
  242260:       320107e8        mov     w8, #0x80000001                 // #-2147483647
  242264:       88097c08        stxr    w9, w8, [x0]
  242268:       34000069        cbz     w9, 242274 <iree_slim_mutex_try_lock+0x24>
  24226c:       2a1f03e0        mov     w0, wzr
  242270:       d65f03c0        ret
  242274:       52800020        mov     w0, #0x1                        // #1
  242278:       d65f03c0        ret

Looking at the Arm64 docs, I read this:

The CLREX instruction clears the monitors, but unlike in ARMv7, exception entry or return also clears the monitor. The monitor might also be cleared spuriously, for example by cache evictions or other reasons not directly related to the application. Software must avoid having any explicit memory accesses, system control register updates, or cache maintenance instructions between paired LDXR and STXR instructions.

exception entry, cache eviction ... etc. I read it as things can go wrong if using __c11_atomic_compare_exchange_weak without handling spurious fails. I guess the test need to handle this type of intermittent failures? Or if using __c11_atomic_compare_exchange_strong is preferred (performance penalty?).

Let me know what you think.

Cheers!

Benoit Jacob · Answer 32 · Thu Nov 16 2023 01:33:48 GMT+0800 (China Standard Time)

Wow, thanks for the debugging!

This has been a compare_exchange_weak ever since this file was first checked-in in 2020:
https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c

I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it -- give me some time to catch up!

Thomas Preud'homme · Answer 33 · Thu Nov 16 2023 06:17:12 GMT+0800 (China Standard Time)

Wow, thanks for the debugging!

This has been a compare_exchange_weak ever since this file was first checked-in in 2020: https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c

I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it -- give me some time to catch up!

It all depends on what's the intent behind iree_task_queue_try_steal and iree_slim_mutex_try_lock really. The weak locking is not guaranteed to lock even if the lock is free. It seems natural for iree_slim_mutex_try_lock to have the same semantic to me. I'm not sure about iree_task_queue_try_steal but 91af04c seems to suggest that it was intentional to avoid too much contention. If that's the case, then perhaps queue_test.cc itself should be changed:

either to expect NULL or a task upon calling iree_task_queue_try_steal
or to loop around iree_task_queue_try_steal until it returns non NULL.

By the way, when outline atomics end up being used the sequence uses a single casa instruction if lse instructions are available which guarantees a free lock to be locked. FYI, whether outline atomics are used depends on:

the llvm version (must contains 4d7df43ffdb460dddb2877a886f75f45c3fee188)
platform (must be using compiler-rt or libgcc >= 9.3.1 if llvm was built after commit c5e7e649d537067dec7111f3de1430d0fc8a4d11)
target isa (if LSE is available, e.g. Armv8.1-A and later, it will use that instead of outline atomics)

Benoit Jacob · Answer 34 · Fri Nov 17 2023 22:45:56 GMT+0800 (China Standard Time)

Sorry about the delay, diving into his now...

Stella Laurenzo · Answer 35 · Fri Nov 17 2023 22:51:52 GMT+0800 (China Standard Time)

platform (must be using compiler-rt or libgcc >= 9.3.1 if llvm was built after commit c5e7e649d537067dec7111f3de1430d0fc8a4d11)

We'd have to check but I think the old EL Linux we build deployment artifacts on may have its baseline libraries before this.

Benoit Jacob · Answer 36 · Fri Nov 17 2023 23:11:22 GMT+0800 (China Standard Time)

Now that I've come back up to speed on this, I think that @RoboTux hit the nail on the head with this:

It all depends on what's the intent behind iree_task_queue_try_steal and iree_slim_mutex_try_lock really. The weak locking is not guaranteed to lock even if the lock is free. It seems natural for iree_slim_mutex_try_lock to have the same semantic to me. I'm not sure about iree_task_queue_try_steal but 91af04c seems to suggest that it was intentional to avoid too much contention. If that's the case, then perhaps queue_test.cc itself should be changed:
* either to expect NULL or a task upon calling iree_task_queue_try_steal
* or to loop around iree_task_queue_try_steal until it returns non NULL.

Indeed, it is intentional that functions that have _try_ in their name can fail, for whatever reason --- trying to specify anything more than "for whatever reason" would be a losing game anyway.

"Whatever reason" may include any combination of thread contention, and implementation details of atomics. So for instance, the above-quoted Arm docs explaining various possible causes of spurious failures are just part of these "whatever reasons". They don't compromise the correctness of iree_slim_mutex_try_lock.

So like @RoboTux suggests, it really is just a buggy test here. Callers of _try_ functions can't require success. This test does just that: https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/queue_test.cc#L309-L310 (this is just one of 6 occurences in this file).

Outside of this test, iree_task_queue_try_steal is called only in one location by iree_task_worker_try_steal_task:
https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/worker.c#L156

In turn, iree_task_worker_try_steal_task is called only in one location by iree_task_executor_try_steal_task_from_affinity_set:
https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/executor.c#L555

This call is inside a for loop for ""max_theft_attempts" iterations. So this caller is quite clearly OK with spurious failures.

So I think the resolution here is:

Treat this as a test bug, fix queue_test.cc.
Clarify in the doc comment that _try_ functions may fail spuriously. For example, the comment on iree_task_worker_try_steal_task could use clearer wording --- it currently says // Returns NULL if no tasks are available and otherwise up to |max_tasks| tasks, that maybe doesn't make it clear that it could also return NULL spuriously even if tasks are available.

Benoit Jacob · Answer 37 · Fri Nov 17 2023 23:14:14 GMT+0800 (China Standard Time)

@benvanik WDYT? The alternative is to try to specify _try_ functions as never failing spuriously, switching then to compare_exchange_strong internally, allowing the existing test to pass as-is as noted by @RoboTux @freddan80 . However, since the only non-test caller of iree_task_queue_try_steal seems to be happy with spurious failures, it seems unnecessary to force it to pay the higher cost of non-spuriously-failing compare_exchange_strong.

Benoit Jacob · Answer 38 · Fri Nov 17 2023 23:17:34 GMT+0800 (China Standard Time)

Note - there is another reason besides performance optimization to avoid compare_exchange_strong if we can. On some targets, compare_exchange_strong has to compile to a loop retrying compare_exchange_weak. These hidden loops mean potentially unbounded number of iterations in contended cases, and in turn causing potentially unbounded contention. By contrast, compare_exchange_weak is an easier-to-think-about mostly-bounded primitive (although even it can compile down to a loop on targets that don't have single-instruction compare_exchange_weak, such as Arm without LSE).

Stella Laurenzo · Answer 39 · Sat Nov 18 2023 01:07:05 GMT+0800 (China Standard Time)

Even if there is not a code bug here, I am really grateful for the analysis: these kind of sync/atomic things are as good as the eyes that have been put on them. Thank you.

Stella Laurenzo · Answer 40 · Sat Nov 18 2023 01:08:53 GMT+0800 (China Standard Time)

I wonder if it is reasonable to run the test code in a loop with some high number of max checks. Iiuc, the condition should be eventually meet but the platform is free to be non deterministic about it.

Ben Vanik · Answer 41 · Sat Nov 18 2023 01:14:26 GMT+0800 (China Standard Time)

so weird!

try functions should be allowed to fail - our iree_slim_mutex_try_lock is just pthread_mutex_trylock/std::mutex::try_lock/etc and should have the same behavior. the queue try_steal is the same (as it's based on those). so sounds like a comment improvement and fixing the test is the way to go - I think the test was assuming it couldn't fail in these cases as it really shouldn't be possible, but apparently it is here and that's unfortunate on arm.

we don't want to lose test coverage because of this quirk, so like stella says maybe just loop a bunch? should be fast (compared to our other tests that compile/run entire models).

Ben Vanik · Answer 42 · Sat Nov 18 2023 01:26:47 GMT+0800 (China Standard Time)

(really great investigation all - I learned a bunch!)

Benoit Jacob · Answer 43 · Sat Nov 18 2023 01:29:01 GMT+0800 (China Standard Time)

Yes, let's make the test loop. If the impl was switched to compare_exchange_strong, that would amount to bringing that loop inside the impl. Instead we're just adding the loop to the test. I think we can just let them loop indefinitely, just like compare_exchange_strong would. Worst case, there are test timeouts --- but that would only happen if there's a bug. Contention and implementation details would only result in a small number of loop iterations.

Benoit Jacob · Answer 44 · Sat Nov 18 2023 03:51:04 GMT+0800 (China Standard Time)

@RoboTux @freddan80 #15636 is out; it's something we probably want to merge anyway since we have a pretty clear understanding now that it's needed anyway, and merging it will make it even easier to retry. But it will be very interesting to hear a confirmation that it actually fixes the problem, and feel free to reopen then.

Thomas Preud'homme · Answer 45 · Sat Nov 18 2023 04:51:50 GMT+0800 (China Standard Time)

Yes, let's make the test loop. If the impl was switched to compare_exchange_strong, that would amount to bringing that loop inside the impl. Instead we're just adding the loop to the test. I think we can just let them loop indefinitely, just like compare_exchange_strong would. Worst case, there are test timeouts --- but that would only happen if there's a bug. Contention and implementation details would only result in a small number of loop iterations.

That's what I would have done myself: loop and relying CTest timeout for the worst case where for some weird reason the lock can never be acquired in a reasonable time. I'd argue if no timeout happens for months and we suddenly start having some it indicates the machine might have suddenly got very busy or something worth investigating.

Fredrik Knutsson · Answer 46 · Mon Nov 20 2023 18:12:38 GMT+0800 (China Standard Time)

@bjacob @benvanik Fix seem to work 👍 Nice team effort getting this fixed. I learned a lot debugging this issue!