iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intermittent failure of iree/task/queue_test on arm64

stellaraccident opened this issue · comments

Seeing this intermittently on arm64 linux runners.

The following tests FAILED:
97 - iree/task/queue_test (Failed)

[ RUN      ] QueueTest.TryStealAll
/work/runtime/src/iree/task/queue_test.cc:310: Failure
Expected equality of these values:
  &task_c
    Which is: 0xffffc6ae6260
  iree_task_queue_try_steal(&source_queue, &target_queue, 1000)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:311: Failure
Expected equality of these values:
  &task_d
    Which is: 0xffffc6ae6[220](https://github.com/openxla/iree/actions/runs/6802769949/job/18496627564?pr=15470#step:6:221)
  iree_task_queue_pop_front(&target_queue)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:316: Failure
Value of: iree_task_queue_is_empty(&source_queue)
  Actual: false
Expected: true

Looks like some memory model, atomic, mumble-mumble thing.

cc @freddan80

Some discussion about this on Discord

From @bjacob :

Basically I am going to find how to trigger that arm64 CI job, and then I'll do it on a pr that causes it to run build_and_test_tsan.sh instead of the normal test script.
If anyone knows, please share here how to trigger that CI job .

https://iree.dev/developers/general/contributing/#ci-behavior-manipulation

ci-exactly: build_test_all_arm64 should do that

weird - task_queue is just using a mutex IIRC

Trying at #15491

Results are in in #15491. All runtime tests pass with TSan. All tests under iree/task were rerun 32 times - no failure.

A a side note: I have run tests with TSan locally on macOS/arm64 and there, we have a bunch of TSan reports of data races in these tests. But it's a different OS and we have in places different code paths for macOS vs Linux (e.g. on futex usage), so that finding could be just false positives or could be about things not relevant to the present Issue.

Definitely macOS specific so not what we're observing here, but here's a fix: #15499

Also trying ASan at #15501

100% runtime tests pass also with ASan...

Summary: at this point, the intermittent here is not reproducing with either ASan or TSan (and the latter with 32 reruns) on the same arm64 CI hosts. I'm going to leave it there, having reached the end of my basic playbook... if we really care, we're going to have to reproduce that locally, e.g. by installing a linux partition on a mac, or getting shell access into the CI host...

Still no repro with TSan at 100 repetitions (requested 256 but apparently CTest caps at 100).

@bjacob thx for the quick analysis.

I haven't observed this issue yet on our machines (AWS - Graviton 2/3), but I'll try to run it a bunch of times and see if I can see it...

@stellaraccident how often does it occur, roughly? Is there a way for me to access those stats without having to manually clicking my way through the CI workflow runs?

I managed to reproduce this. It usually happens after a few thousand runs. Command:

IREE_CTEST_TESTS_REGEX=queue IREE_CTEST_REPEAT_UNTIL_FAIL_COUNT=100000 ./build_tools/cmake/ctest_all.sh ./iree-build-all/

on bc98b9a04b. And I'm on a 64 core Graviton 2. I haven't got to debugging it yet, but I think I'll have some time for that tomorrow.

Awesome!

Sanitizers might help here - you can see what I did in #15491 (for TSan) and #15501 (for ASan). Since this is in the runtime and does not depend on the compiler, you can configure like I do there - for example for TSan, these are my CMake flags: https://github.com/openxla/iree/blob/a1448a33f029b8dc8d1755142ba5c4ad2b2dd58a/build_tools/cmake/build_and_test_tsan.sh#L25-L58

Specifically:

cmake \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DIREE_ENABLE_ASSERTIONS=ON \
  -DIREE_BUILD_COMPILER=OFF \
  -DIREE_ENABLE_LLD=ON \
  -DIREE_ENABLE_TSAN=ON \

The BYTECODE_MODULE_* settings are irrelevant when IREE_BUILD_COMPILER=OFF.

Also note re

IREE_CTEST_TESTS_REGEX=queue IREE_CTEST_REPEAT_UNTIL_FAIL_COUNT=100000

I sometimes find that when trying to reproduce a TSan failure on a specific test with many repetitions, it pays to filter in more tests than just the one I care about, to introduce more non-determinism in the scheduling of threads. It is a problem with CTest that if you filter a single test and set many repetitions, they will run sequentially instead of in parallel -- the parallelization dimension is only "across filtered tests" and not "across repetitions". So my usual workarounds have been either (1) filter in more tests, e.g. filter iree/task not just queue, or (2) write my own testing script launching many parallel processes (in that case, since the goal is to create some scheduling chaos, it's not a bad idea to schedule more threads than the system's hardware concurrency).

👍 I'll give this a try. I'm not familiar with the tool yet but I'll read up on it

TL;DR - Seems to be related to building in the docker container. I can't reproduce the issues natively, by building on my Gravition 2 Ubuntu 22.04, only by building in the docker container. I'm a bit confused...

Some observations. I can run 100k iterations on the Graviton 2 native (Ubuntu 22.04). The issue happens when I build in the container based on:

gcr.io/iree-oss/base-arm
64@sha256:942d01a396f81bff06c5ef4643ba3c9f4500a09010cd30d9ed9be691ddaf1353

Trying to build with tsan in the docker container doesn't work for me, which is weird since I use the same docker image and your patch (building with tsan natively works strangely). When I try to build with tsan in the docker container, I get cmake complaints:

-- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
CMake Error at third_party/benchmark/CMakeLists.txt:316 (message):
  Failed to determine the source files for the regular expression backend

I get around that but adding the cmake args:

  "-DHAVE_STD_REGEX=ON"
  "-DRUN_HAVE_STD_REGEX=1"

But then I get a bunch of errors like this:

FATAL: ThreadSanitizer CHECK failed: /build/llvm-toolchain-9-sL57p3/llvm-toolchain-9-9.0.1/compiler-rt/lib/tsan/rtl/tsan_platform_linux.cc:297 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (generate_embed_data+0x2c3f48)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) <null> (generate_embed_data+0x2d9d30)
    #2 __tsan::InitializePlatform() <null> (generate_embed_data+0x2cb15c)
    #3 __tsan::Initialize(__tsan::ThreadState*) <null> (generate_embed_data+0x2b4c70)
    #4 <null> <null> (ld-linux-aarch64.so.1+0xea38)
    #5 <null> <null> (ld-linux-aarch64.so.1+0x1180) 

I'm probably missing something here...

Anyways, it made me think that it has something to do with the LLVM version (container uses v9 by default). So I built using clang-14 instead of clang-9, but the issues is still there.

The only way for me to reproduce the issue natively is by building in the docker container and running ctest on the that build natively.

This is getting interesting! I googled the failed CHECK condition in your error log, and found this: golang/go#35547 (comment) . It gives a plausible reason why this would only happen in a container and only on ppc64 and arm64 but not x86_64, and provides a possible work-around. Can you try that?

Works! So now I can build using the thread sanitizer inside the docker container. Let's see if it catches the issue. Feels like I should have been caught it by now, but still running ... (10x slower with tsan?)

I have to go pick up kids now though. I'll check in later

For reference, the workaround (interactive):

docker run --rm -it --mount="type=bind,src=./,dst=/iree" --security-opt seccomp=unconfined gcr.io/iree-oss/base-arm64@sha256:942d01a396f81bff06c5ef4643ba3c9f4500a09010cd30d9ed9be691ddaf1353

Fantastic! Yes, 10x slower sounds plausible for TSan (it varies depending on the workload).

Unfortunately, I don't see the issue - it's still running 🤷

Two things I can think of doing next 1) look at the disassembly of the working / non-working queue test 2) rule out that the Ubuntu version has nothing to do with it. Will continue on Monday.

Fredrik sent me the test where iree was built inside docker Vs the one where iree was built on the host and I noticed when comparing them that the host version uses outline atomics whereas the one in docker doesn't. I would still expect it to work but since it's related to atomics and this failure is intermittent it seemed potentially significant.

While at it I was looking at the iree_task_queue_try_steal() function used in the tryStealAll test and noticed that no list splitting happens if the caller already hold the mutex. Is that expected? Finally, I've noticed that the stolen_tasks list is initialized both in iree_task_queue_try_steal and in iree_task_list_split. Perhaps the later should be guaranteed by the API and then the former can be removed?

Hopefully I'll make more progress tomorrow.

Humm out-of-line atomics... I wonder what would happen if we linked together armv8.0 code (using LDXR/STXR loops) and armv8.1 code (using LDADD etc single-instruction atomics). Are these 100% compatible or is that a potential cause of data races?

iree_task_queue_try_steal could be reworked to early exit if the try fails, but I wouldn't change the API requirements of split (as that's used elsewhere) - I believe iree_task_queue_try_steal did more in the past when the lock would fail (try stealing in a variety of ways, one of which was via the lock). I don't know if I'd do any of that before figuring out the issue, though, as we have a reproducer of something that should work but isn't and we don't want to lose that.

Humm out-of-line atomics... I wonder what would happen if we linked together armv8.0 code (using LDXR/STXR loops) and armv8.1 code (using LDADD etc single-instruction atomics). Are these 100% compatible or is that a potential cause of data races?

That's where my mind was at but wouldn't that require different part of IREE runtime using different atomic instructions? I'm gonna keep digging.

iree_task_queue_try_steal could be reworked to early exit if the try fails, but I wouldn't change the API requirements of split (as that's used elsewhere) - I believe iree_task_queue_try_steal did more in the past when the lock would fail (try stealing in a variety of ways, one of which was via the lock). I don't know if I'd do any of that before figuring out the issue, though, as we have a reproducer of something that should work but isn't and we don't want to lose that.

I wasn't suggesting a change to solve this problem, just flagging a potential improvement for later on. Also regarding the double initialization I was suggesting for it to be documented as part of split, thus removing the need for the initialization in the caller.

No worries!
In C it's always best to not rely on documentation for things like default initialization - that iree_task_list_split memset(0)'s the list does not mean we want to also not memset(0) it in the callers - compilers are better at eliding redundant memset(0)s than humans are at reading documentation and knowing when they need to do it themselves or not :)

I can reproduce the failure with the test built by @freddan80 in docker by running just queue_test in a loop. From what I could find online googletest does not execute tests in parallel so that suggests more of an uninitialized bug or use after free. However running the tool under valgrind's memcheck with --leak-check=full didn't show anything, even for runs where the test failed.

Short update. I managed to reproduce the issue by building the runtime outside the docker (on ubuntu 22.04). By using clang 12.0.1 coincidentally... Hence, this:

TL;DR - Seems to be related to building in the docker container. I can't reproduce the issues natively, by building on my Gravition 2 Ubuntu 22.04, only by building in the docker container. I'm a bit confused...

is debunked. It makes me believe the issue is generic and just happen appear randomly depending on who-knows-what. I've ran a few marathon asan and tsan tests with various permutations, but it never triggers the issue. I'll give this some more attention later today.

Great! Focus on TSan. Run the test in a script that runs it in parallel in many threads, not relying on ctest to parallelize, due to the above explained ctest limitation. Set a really large number of threads to make scheduling chaotic.

Also maybe edit the test to make it loop more iterations.

Me and @RoboTux had a joint look at this today. It seem the lock fails, which eventually is cause by this:

https://github.com/openxla/iree/blame/f031ce8ea3050c9a87afde41df253472b362c241/runtime/src/iree/base/internal/synchronization.c#L456-L473

which in extension calls this:

https://github.com/openxla/iree/blob/1012586173e74509841f004e46ec5cfa3925285b/runtime/src/iree/base/internal/atomics_clang.h#L62

which return false. By printign, we can see the mutex is not taken. IIUC __c11_atomic_compare_exchange_weak is allowed to fail spuriously (ref). Now, if I replace that call with __c11_atomic_compare_exchange_strong, the test pass 100% of the time.

The asm of __c11_atomic_compare_exchange_weak could look something like this depending on compiler, version etc.:

0000000000242250 <iree_slim_mutex_try_lock>:
  242250:       885ffc08        ldaxr   w8, [x0]
  242254:       34000068        cbz     w8, 242260 <iree_slim_mutex_try_lock+0x10>
  242258:       d5033f5f        clrex
  24225c:       14000004        b       24226c <iree_slim_mutex_try_lock+0x1c>
  242260:       320107e8        mov     w8, #0x80000001                 // #-2147483647
  242264:       88097c08        stxr    w9, w8, [x0]
  242268:       34000069        cbz     w9, 242274 <iree_slim_mutex_try_lock+0x24>
  24226c:       2a1f03e0        mov     w0, wzr
  242270:       d65f03c0        ret
  242274:       52800020        mov     w0, #0x1                        // #1
  242278:       d65f03c0        ret

Looking at the Arm64 docs, I read this:

The CLREX instruction clears the monitors, but unlike in ARMv7, exception entry or return also clears the monitor. The monitor might also be cleared spuriously, for example by cache evictions or other reasons not directly related to the application. Software must avoid having any explicit memory accesses, system control register updates, or cache maintenance instructions between paired LDXR and STXR instructions.

exception entry, cache eviction ... etc. I read it as things can go wrong if using __c11_atomic_compare_exchange_weak without handling spurious fails. I guess the test need to handle this type of intermittent failures? Or if using __c11_atomic_compare_exchange_strong is preferred (performance penalty?).

Let me know what you think.

Cheers!

Wow, thanks for the debugging!

This has been a compare_exchange_weak ever since this file was first checked-in in 2020:
https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c

I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it -- give me some time to catch up!

Wow, thanks for the debugging!

This has been a compare_exchange_weak ever since this file was first checked-in in 2020: https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c

I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it -- give me some time to catch up!

It all depends on what's the intent behind iree_task_queue_try_steal and iree_slim_mutex_try_lock really. The weak locking is not guaranteed to lock even if the lock is free. It seems natural for iree_slim_mutex_try_lock to have the same semantic to me. I'm not sure about iree_task_queue_try_steal but 91af04c seems to suggest that it was intentional to avoid too much contention. If that's the case, then perhaps queue_test.cc itself should be changed:

  • either to expect NULL or a task upon calling iree_task_queue_try_steal
  • or to loop around iree_task_queue_try_steal until it returns non NULL.

By the way, when outline atomics end up being used the sequence uses a single casa instruction if lse instructions are available which guarantees a free lock to be locked. FYI, whether outline atomics are used depends on:

  • the llvm version (must contains 4d7df43ffdb460dddb2877a886f75f45c3fee188)
  • platform (must be using compiler-rt or libgcc >= 9.3.1 if llvm was built after commit c5e7e649d537067dec7111f3de1430d0fc8a4d11)
  • target isa (if LSE is available, e.g. Armv8.1-A and later, it will use that instead of outline atomics)

Sorry about the delay, diving into his now...

  • platform (must be using compiler-rt or libgcc >= 9.3.1 if llvm was built after commit c5e7e649d537067dec7111f3de1430d0fc8a4d11)

We'd have to check but I think the old EL Linux we build deployment artifacts on may have its baseline libraries before this.

Now that I've come back up to speed on this, I think that @RoboTux hit the nail on the head with this:

It all depends on what's the intent behind iree_task_queue_try_steal and iree_slim_mutex_try_lock really. The weak locking is not guaranteed to lock even if the lock is free. It seems natural for iree_slim_mutex_try_lock to have the same semantic to me. I'm not sure about iree_task_queue_try_steal but 91af04c seems to suggest that it was intentional to avoid too much contention. If that's the case, then perhaps queue_test.cc itself should be changed:

* either to expect NULL or a task upon calling iree_task_queue_try_steal
* or to loop around iree_task_queue_try_steal until it returns non NULL.

Indeed, it is intentional that functions that have _try_ in their name can fail, for whatever reason --- trying to specify anything more than "for whatever reason" would be a losing game anyway.

"Whatever reason" may include any combination of thread contention, and implementation details of atomics. So for instance, the above-quoted Arm docs explaining various possible causes of spurious failures are just part of these "whatever reasons". They don't compromise the correctness of iree_slim_mutex_try_lock.

So like @RoboTux suggests, it really is just a buggy test here. Callers of _try_ functions can't require success. This test does just that: https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/queue_test.cc#L309-L310 (this is just one of 6 occurences in this file).

Outside of this test, iree_task_queue_try_steal is called only in one location by iree_task_worker_try_steal_task:
https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/worker.c#L156

In turn, iree_task_worker_try_steal_task is called only in one location by iree_task_executor_try_steal_task_from_affinity_set:
https://github.com/openxla/iree/blob/7963ca781b6f3782ad21d20ac24b7adca150e36f/runtime/src/iree/task/executor.c#L555

This call is inside a for loop for ""max_theft_attempts" iterations. So this caller is quite clearly OK with spurious failures.

So I think the resolution here is:

  1. Treat this as a test bug, fix queue_test.cc.
  2. Clarify in the doc comment that _try_ functions may fail spuriously. For example, the comment on iree_task_worker_try_steal_task could use clearer wording --- it currently says // Returns NULL if no tasks are available and otherwise up to |max_tasks| tasks, that maybe doesn't make it clear that it could also return NULL spuriously even if tasks are available.

@benvanik WDYT? The alternative is to try to specify _try_ functions as never failing spuriously, switching then to compare_exchange_strong internally, allowing the existing test to pass as-is as noted by @RoboTux @freddan80 . However, since the only non-test caller of iree_task_queue_try_steal seems to be happy with spurious failures, it seems unnecessary to force it to pay the higher cost of non-spuriously-failing compare_exchange_strong.

Note - there is another reason besides performance optimization to avoid compare_exchange_strong if we can. On some targets, compare_exchange_strong has to compile to a loop retrying compare_exchange_weak. These hidden loops mean potentially unbounded number of iterations in contended cases, and in turn causing potentially unbounded contention. By contrast, compare_exchange_weak is an easier-to-think-about mostly-bounded primitive (although even it can compile down to a loop on targets that don't have single-instruction compare_exchange_weak, such as Arm without LSE).

Even if there is not a code bug here, I am really grateful for the analysis: these kind of sync/atomic things are as good as the eyes that have been put on them. Thank you.

I wonder if it is reasonable to run the test code in a loop with some high number of max checks. Iiuc, the condition should be eventually meet but the platform is free to be non deterministic about it.

so weird!

try functions should be allowed to fail - our iree_slim_mutex_try_lock is just pthread_mutex_trylock/std::mutex::try_lock/etc and should have the same behavior. the queue try_steal is the same (as it's based on those). so sounds like a comment improvement and fixing the test is the way to go - I think the test was assuming it couldn't fail in these cases as it really shouldn't be possible, but apparently it is here and that's unfortunate on arm.

we don't want to lose test coverage because of this quirk, so like stella says maybe just loop a bunch? should be fast (compared to our other tests that compile/run entire models).

(really great investigation all - I learned a bunch!)

Yes, let's make the test loop. If the impl was switched to compare_exchange_strong, that would amount to bringing that loop inside the impl. Instead we're just adding the loop to the test. I think we can just let them loop indefinitely, just like compare_exchange_strong would. Worst case, there are test timeouts --- but that would only happen if there's a bug. Contention and implementation details would only result in a small number of loop iterations.

@RoboTux @freddan80 #15636 is out; it's something we probably want to merge anyway since we have a pretty clear understanding now that it's needed anyway, and merging it will make it even easier to retry. But it will be very interesting to hear a confirmation that it actually fixes the problem, and feel free to reopen then.

Yes, let's make the test loop. If the impl was switched to compare_exchange_strong, that would amount to bringing that loop inside the impl. Instead we're just adding the loop to the test. I think we can just let them loop indefinitely, just like compare_exchange_strong would. Worst case, there are test timeouts --- but that would only happen if there's a bug. Contention and implementation details would only result in a small number of loop iterations.

That's what I would have done myself: loop and relying CTest timeout for the worst case where for some weird reason the lock can never be acquired in a reasonable time. I'd argue if no timeout happens for months and we suddenly start having some it indicates the machine might have suddenly got very busy or something worth investigating.

@bjacob @benvanik Fix seem to work 👍 Nice team effort getting this fixed. I learned a lot debugging this issue!