flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[aarch64] possible issue with atomic barrier and generic implementation (lack of good atomic support on generic kernel ?)

egaudry opened this issue · comments

I've been observing a strange behavior when launching some job on an Ampere Altra processor (ARM neoverse-n1 based design) with multiple threads. The symptoms remain the same with both pthread and openmp threading : when launching the computation with multiple threads, the computation is likely to deadlock when a dgemm_ request is made.

By looking at the stracktrace, when the dgemm_ call is made, I can see an awful lot of threads there (see extract below, with the thread number). In this example, I was running with 64 threads in BLIS (with a 80-core processor).

Thread 1103 (Thread 0xffe2b1fcc5f0 (LWP 3186538) "actranpy"):
#0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
#1  0x0000ffffb3f99f8c in bli_thrcomm_bcast () from /bin/../lib/libblis.so.4
#2  0x0000ffffb3eee56c in bli_packm_alloc () from /bin/../lib/libblis.so.4
#3  0x0000ffffb3eefe2c in bli_packm_init () from /bin/../lib/libblis.so.4
#4  0x0000ffffb3eee638 in bli_packm_blk_var1 () from /bin/../lib/libblis.so.4
#5  0x0000ffffb3eefecc in bli_packm_int () from /bin/../lib/libblis.so.4
#6  0x0000ffffb3f1b784 in bli_l3_packb () from /bin/../lib/libblis.so.4
#7  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
#8  0x0000ffffb3f2d850 in bli_gemm_blk_var3 () from /bin/../lib/libblis.so.4
#9  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
#10 0x0000ffffb3f2d6fc in bli_gemm_blk_var2 () from /bin/../lib/libblis.so.4
#11 0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
#12 0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
#13 0x0000ffffb3627898 in start_thread () from /lib64/libpthread.so.0
#14 0x0000ffffb3571ddc in thread_start () from /lib64/libc.so.6
[...]
Thread 667 (Thread 0xffe44e2ec5f0 (LWP 3220206) "actranpy"):
#0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
#1  0x0000ffffb3f99f8c in bli_thrcomm_bcast () from /bin/../lib/libblis.so.4
#2  0x0000ffffb3f9b94c in bli_thrinfo_create_for_cntl () from /bin/../lib/libblis.so.4
#3  0x0000ffffb3f9ba5c in bli_thrinfo_rgrow () from /bin/../lib/libblis.so.4
#4  0x0000ffffb3f9bcb0 in bli_thrinfo_grow () from /bin/../lib/libblis.so.4
#5  0x0000ffffb3f1a374 in bli_l3_int () from /bin/../lib/libblis.so.4
#6  0x0000ffffb3f2d6fc in bli_gemm_blk_var2 () from /bin/../lib/libblis.so.4
#7  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
#8  0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
#9  0x0000ffffb3627898 in start_thread () from /lib64/libpthread.so.0
#10 0x0000ffffb3571ddc in thread_start () from /lib64/libc.so.6

original call is

Thread 1 (Thread 0xffffb1fa3850 (LWP 3137984) "actranpy"):
#0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
#1  0x0000ffffb3f9b998 in bli_thrinfo_create_for_cntl () from /bin/../lib/libblis.so.4
#2  0x0000ffffb3f9bcb0 in bli_thrinfo_grow () from /bin/../lib/libblis.so.4
#3  0x0000ffffb3f1a374 in bli_l3_int () from /bin/../lib/libblis.so.4
#4  0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
#5  0x0000ffffb3f99bf0 in bli_l3_thread_decorator () from /bin/../lib/libblis.so.4
#6  0x0000ffffb3f2e0b4 in bli_gemm_front () from /bin/../lib/libblis.so.4
#7  0x0000ffffb3f1a6d8 in bli_gemm_ex () from /bin/../lib/libblis.so.4
#8  0x0000ffffb3f62834 in dgemm_ () from /bin/../lib/libblis.so.4

How can I help identifying the underlying problem ?

Can you write a stand-alone test for bli_thrcomm_barrier_atomic?

Hi Jeff, I know I should, however I will have a hard-time doing it now.
Are you asking so as you believe that the issue is likewise on the caller side ?

I'm indeed able to launch some of the test with BLIS_NUM_THREADS set to 32 without encountering any issue.

$ BLIS_NUM_THREADS=32 make test
Running test_libblis.x with output redirected to 'output.testsuite'
check-blistest.sh: All BLIS tests passed!
Running cblat1.x > 'out.cblat1'
Running cblat2.x < './blastest/input/cblat2.in' (output to 'out.cblat2')
Running cblat3.x < './blastest/input/cblat3.in' (output to 'out.cblat3')
Running dblat1.x > 'out.dblat1'
Running dblat2.x < './blastest/input/dblat2.in' (output to 'out.dblat2')
Running dblat3.x < './blastest/input/dblat3.in' (output to 'out.dblat3')
Running sblat1.x > 'out.sblat1'
Running sblat2.x < './blastest/input/sblat2.in' (output to 'out.sblat2')
Running sblat3.x < './blastest/input/sblat3.in' (output to 'out.sblat3')
Running zblat1.x > 'out.zblat1'
Running zblat2.x < './blastest/input/zblat2.in' (output to 'out.zblat2')
Running zblat3.x < './blastest/input/zblat3.in' (output to 'out.zblat3')
check-blastest.sh: All BLAS tests passed!

I'm also wondering about the right options needed to build BLIS on aarch64 (armv8.2+ architectures) as it seems -moutline-atomics would be needed, in case the -mcpu option is not used. I've been using the arm64 config from master, to rely on the proper kernel detection at runtime, but I might have misunderstood its internal working.

It was indeed linked to the choice made when building arm64 means corea57 core a53 firestorm etc. + generic.
I believe the "generic" implementation was chosen at runtime (core is neoverse, might be not detected "properly"), without proper atomics support (via -moutline-atomics for instance).
When directly building the thunderx2 flavor of blis, I don't have this issue anymore.

I think this issue should remain open until the build system handles it properly.

Also, a friend with expertise told me:

compile with -mcpu=neoverse-n1 to get LSE atomics. no need for the outline atomics

Jeff, when building the different kernels for arm64 (meaning all arch with dynamic dispatch), the build system doesn't use the proper flag for each of the objects built. If you had -mpcu=neoverse-n1 as a flag for all, then there is a conflict as specific cpu target are set for each of the kernel.

@jeffhammond, @egaudry is correct, when compiling for arm64, one is implicitly saying the library should run on any arm64 hardware, some of which do not have hardware atomics I guess, and certainly not all hardware will work with Neoverse-specific flags. Am I correct that the correct alternative is to add -moutline-atomics for general arm64 code?

@egaudry what compiler is this? gcc docs say -moutline-atomics is the default.

@devinamatthews I'm using gcc-10.3.0 (built from sources). I'm not sure about -moutline-atomics being the default behavior when reading https://gcc.gnu.org/gcc-10/changes.html.

Regarding your question about adding -moutline-atomics for general arm64 code (if the flag is not default activated), I believe that all parts of the library that would not rely on a specific -mcpu would need it.

This is my understanding (please note that I'm not an ARM expert), based on reading the build system code used to build openblas where we can see different -mcpu and build options used for various objects/source files/kernels.

I am now wondering if there is something else than this too.
I've just rebuild the master branch using GCC-11.2 and the thunderx2 variant and I'm encountering the same issue (bli_thrcomm_barrier_atomic issue).

The current code relies on befaee6 I believe.

ARMv8 introduced 64-bit. ARMv8-A is the variant relevant to us, since I suspect realtime or microcontroller implementations break other aspects of BLIS (e.g. C RTL, especially heap allocation).

ARMv8-A has atomics (https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/One-way-barriers) just not the better LSE ones added in ARMv8.1.

As far as I know, there isn't a mainstream multicore processor in use today that doesn't have atomics operations.

Thanks Jeff.
I've been rebuilding blas master (thunderx2) with gcc-11 on debian 11 with kernel 5.10 and I'm still seeing the issue in the atomic barrier mentioned when creating this issue.
Any pointer on what to check/how debug this ?

In any case, the code uses the GCC __atomic intrinsics, which usually map to hardware, but could map to some awful emulated implementation if there existed a CPU that didn't have the necessary hardware. For example, on PowerPC, an operation like __atomic_fetch_xor will not compile to a single instruction but a loop around LLSC.

I think I have a TX2 somewhere. Let me see what I can do.

Just to clarify @jeffhammond, I'm not using a thunderx2 processor, but a neoverse-n1 based one. I'm using the thunderx2 flavor of Blis as it is the most recent one available matching this core. I'm also relying on the build system available in Blis and I covered the basics using various building options (for atomics).

And thanks for your help :)

Just out of curiosity, does the BLIS build that comes with the OS package manager work?

not, it doesn't, at least with my current setup:

$uname -a
Linux begui-armts21 5.10.0-10-arm64 #1 SMP Debian 5.10.84-1 (2021-12-08) aarch64 GNU/Linux
$apt-file list libblis3-openmp
libblis3-openmp: /usr/lib/aarch64-linux-gnu/blis-openmp/libblas.so
libblis3-openmp: /usr/lib/aarch64-linux-gnu/blis-openmp/libblas.so.3
libblis3-openmp: /usr/lib/aarch64-linux-gnu/blis-openmp/libblis.so.3
libblis3-openmp: /usr/share/doc/libblis3-openmp/changelog.Debian.gz
libblis3-openmp: /usr/share/doc/libblis3-openmp/changelog.gz
libblis3-openmp: /usr/share/doc/libblis3-openmp/copyright
libblis3-openmp: /usr/share/lintian/overrides/libblis3-openmp
$apt show libblis3-openmp
Package: libblis3-openmp
Version: 0.8.0-1
Priority: optional
Section: libs
Source: blis
Maintainer: Debian Science Maintainers <debian-science-maintainers@lists.alioth.debian.org>
Installed-Size: 4,750 kB
Provides: libblas.so.3, libblis.so.2
Depends: libc6 (>= 2.29), libgomp1 (>= 4.9)
Breaks: libblis2-openmp
Replaces: libblis2-openmp
Homepage: https://github.com/flame/blis
Tag: role::shared-lib
Download-Size: 838 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian bullseye/main arm64 Packages
Description: BLAS-like Library Instantiation Software Framework (32bit,openmp)
 BLIS is a portable software framework for instantiating high-performance
 BLAS-like dense linear algebra libraries. The framework was designed to
 isolate essential kernels of computation that, when optimized, immediately
 enable optimized implementations of most of its commonly used and
 computationally intensive operations. BLIS is written in ISO C99 and available
 under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like
 API, it also includes a BLAS compatibility layer which gives application
 developers access to BLIS implementations via traditional BLAS routine calls.
 An object-based API is also available for more experienced users.
 .
 The package contains the (-t openmp) version of shared library.

So far, I see no failures.

# Compiler
gcc version 11.1.0 (Ubuntu 11.1.0-1ubuntu1~20.04) 
# OS
Linux 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:30:35 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
# Configure
CC=gcc-11 CXX=g++-11 ./configure -t pthread thunderx2
# Test
BLIS_NUM_THREADS=64 make run

What test are you running that fails?

I cannot replicate the issue using the embedded test suite.
I can replicate it using a big HPC numerical simulation software test-suite.

FWIW, I tested with openblas-0.3.13 (using the deb package) and the issue is gone, at least I don't see it anymore.

I can't do much without a reproducer. If it's something you can share privately, send me an email.

You might try with this, which will use the sequentially consistent atomics, which are likely slower, but might reveal some defect in the implementation.

diff --git a/frame/thread/bli_thrcomm.c b/frame/thread/bli_thrcomm.c
index ef46a7ad..0c5727ff 100644
--- a/frame/thread/bli_thrcomm.c
+++ b/frame/thread/bli_thrcomm.c
@@ -54,7 +54,7 @@ void* bli_thrcomm_bcast
 }
 
 // Use __sync_* builtins (assumed available) if __atomic_* ones are not present.
-#ifndef __ATOMIC_RELAXED
+#if 1 //ndef __ATOMIC_RELAXED
 
 #define __ATOMIC_RELAXED
 #define __ATOMIC_ACQUIRE

Doesn't seem to help, sorry.
Is there a (hidden) way to avoid relying on gnu atomic ?

If you use Pthreads and set BLIS_USE_PTHREAD_BARRIER then that should use the slow and dumb Pthread barrier, which goes into the kernel. That should not have a race in it.

Have you tried other GCC versions or Clang (or even NVHPC C/C++) to see if it's a compiler issue? I can't remember if the ARM compiler requires a license or not.

I tried GCC-10.2 and GCC-11.2 (custom build or OS packaged versions).

I've just tried building master with clang-11.0.1 but lamely failed.

Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels)
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition
        "                                            \n\t"
                                                       ^
<inline asm>:90:5: note: instantiated into assembly here
           .SLOOPKITER:
           ^
1 error generated.

BLIS_USE_PTHREAD_BARRIER

bli_thrcomm_pthreads.c needs update to be able to build (missing ifdef)

frame/thread/bli_thrcomm.c: In function ‘bli_thrcomm_barrier_atomic’:
frame/thread/bli_thrcomm.c:89:44: error: ‘thrcomm_t’ {aka ‘struct thrcomm_s’} has no member named ‘barrier_sense’
   89 |  gint_t orig_sense = __atomic_load_n( &comm->barrier_sense, __ATOMIC_RELAXED );

patch

--- a/frame/thread/bli_thrcomm.c  2022-01-11 01:32:25.000000000 +0100
+++ b/frame/thread/bli_thrcomm.c       2022-01-13 15:44:06.722052670 +0100
@@ -53,6 +53,7 @@
        return object;
 }

+#ifndef BLIS_USE_PTHREAD_BARRIER
 // Use __sync_* builtins (assumed available) if __atomic_* ones are not present.
 #ifndef __ATOMIC_RELAXED

@@ -116,3 +117,4 @@
        }
 }

+#endif

When using blis master build using BLIS_USE_PTHREAD_BARRIER, the issue is indeed gone (and the downside is that actions are now very slow) even with 2 threads.

ok. the other think i might try is to set BLIS_NUM_STATIC_COMMS to zero, so that all thrcomm_t are allocated thread-safely on the heap, instead of with a static. i don't know your usage but if you have multiple threads entering BLIS and multiple threads hit a barrier at the same time, then this might be an issue.

this didn't work either.
i'm sorry i'm not able to share a reproducer jeff.

i believe i'm entering some openmp locked part of code where blis is requested to worked multithreaded, leaving for some other openmp parallel part of code after.

i fixed the Clang assembly issue. will PR shortly.

is the reproducer proprietary to some company that might have sufficient legal agreements with my employer to move forward on this?

or even Ampere, in which case, i can connect you to people there.

proprietary, and no NDA signed with your current employer.
you can connect me with Ampere, sure!

i fixed the Clang assembly issue. will PR shortly.

when building with clang-11.0.1 using the missing ifdef from your PR, an openmp build of BLIS (without any other change) does work as expected.

sorry, I was too fast, the issue is now random.

(gdb) where
#0  0x0000ffff9309b08c in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
#1  0x0000ffff9309d650 in bli_thrinfo_create_for_cntl () from /bin/../lib/libblis.so.4
#2  0x0000ffff9309d4fc in bli_thrinfo_rgrow () from /bin/../lib/libblis.so.4
#3  0x0000ffff9309d928 in bli_thrinfo_grow () from /bin/../lib/libblis.so.4
#4  0x0000ffff92fdf7c4 in bli_l3_int () from /bin/../lib/libblis.so.4
#5  0x0000ffff92ff21a4 in bli_gemm_blk_var2 () from /bin/../lib/libblis.so.4
#6  0x0000ffff92fdf7e8 in bli_l3_int () from /bin/../lib/libblis.so.4
/bin/../lib/libblis.so.4
#8  0x0000ffff921becbc in __kmp_invoke_microtask () from /lib/aarch64-linux-gnu/libomp.so.5

I have determined the issue here is the lack of proper backoff in the spin-wait loop of the atomic barrier.

When I stick a sched_yield() in the body, the time drops from 2-8 hours down to 2 minutes, relative to a single-threaded time of less than 10 seconds.

// If the current thread is NOT the last thread to have arrived, then
// it spins on the sense variable until that sense variable changes at
// which time these threads will exit the barrier.
while ( __atomic_load_n( &comm->barrier_sense, __ATOMIC_ACQUIRE ) == orig_sense )
     ; // Empty loop body.

Is that the core issue or only an exacerbating factor? BTW I started to address this in #82 but never finished it up.

Also, is -moutline-atomics a red herring or does that also fix it?

Lastly, from the original post is seemed like spawning way too many threads was also part of or all of the problem. Is that the case?

Is that the core issue or only an exacerbating factor? BTW I started to address this in #82 but never finished it up.

@devinamatthews I believe there would be value in offering, optionally, the yielding feature you were working on.

I might be wrong, but current openblas (which didn't show the stale behavior) uses

#if defined(ARMV7) || defined(ARMV6) || defined(ARMV8) || defined(ARMV5)
#define YIELDING        __asm__ __volatile__ ("nop;nop;nop;nop;nop;nop;nop;nop; \n");
#endif

Merging #82 plus some more work would allow for debugging experiments I guess.

By looking the CHANGELOG, it seems that befaee6 from @fgvanzee handled relevant part of the code dealing with barrier.

Lastly, from the original post is seemed like spawning way too many threads was also part of or all of the problem. Is that the case?

This is my limited understanding indeed.

After spending too much time trying to understand why blis would stall in my scenarii, I finally revisited our own sources and realized that we were actually oversubscribing. This has gone unnoticed on our x86_64 build linked with another BLAS implementation, which - under the hood - tries to overcome the following usual scenario:

#pragma parallel omp num_threads( N>1 )
  #pragma omp for 
     gemm( num_threads>1 )

I'm to blame, sorry about that @devinamatthews @jeffhammond .

I learned a lot here and there is a real problem in barrier that should be fixed. We need a sched_yield() there to improve QOS when threads collide.

I can also try to mitigate the oversubscription issue the way MKL does. That might take longer.

@jeffhammond I'm not 100% convinced sched_yield is always the best solution but yes we need SOMETHING there. Can you open a separate issue on this.

Detection of OpenMP parallel regions would also be great. Another issue please?