flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When could you support AMD Zen4 arch?

ltjsjyyy opened this issue · comments

Zen4 is already support in AMD's fork of BLIS. We're in contact with AMD on coordinating how best to back-port these changes to BLIS master.

Hi. I've conducted some experiments using scripts from https://github.com/flame/blis/blob/master/docs/Performance.md and AMD's fork of BLIS. I tested only GEMM and only in multithread mode, as https://github.com/amd/blis/tree/master/test/3 output is not compatible with https://github.com/flame/blis/tree/master/test/3 , but this test was enough for initial needs.

My setup:

  • Processor model: AMD Ryzen 9 7950X3D (Zen4)
  • Core topology: one socket, 16 cores per socket, 32 cores total
  • SMT status: enabled, used
  • OS: Gentoo
  • Compiler: Clang 17.0.3 (CC="clang" CXX="clang++" AR="llvm-ar" RANLIB="llvm-ranlib" ./configure -t openmp zen4)
  • Stock blis compiled with zen3 kernels. All libraries in general were compiled with native to zen4 flags.
  • Versions:
    ** AMD/blis master a5a3c8b Mon Aug 7 13:48:54 2023
    ** flame/blis master f7ce54a Fri Nov 3 15:52:57 2023
    ** sci-libs/mkl-2023.0.0.25398
    ** sci-libs/openblas-0.3.23

Commands executed:

BLIS_NUM_THREADS=32     ./test_sgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d s -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d s -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d s -c nn   -i native -p "256 5120 128" -r 3 -v

BLIS_NUM_THREADS=32     ./test_dgemm_5120_asm_blis_st.x  # amd-blis
BLIS_NUM_THREADS=32     ./test_gemm_blis_mt.x     -d d -c nn   -i native -p "256 5120 128" -r 3 -v
MKL_NUM_THREADS=32      ./test_gemm_vendor_mt.x   -d d -c nn   -i native -p "256 5120 128" -r 3 -v
OPENBLAS_NUM_THREADS=32 ./test_gemm_openblas_mt.x -d d -c nn   -i native -p "256 5120 128" -r 3 -v

Results:
image

Comments: AMD fork of BLIS significantly outperforms all other libraries on AMD Ryzen 9 7950X3D with Zen4 kernels (up to 2x). Vanilla BLIS is on par with OpenBLAS, but slower than MKL. There is a performance drop in MKL library for some sizes, but it looks like a fluke (it disappears for larger sizes). When checking gemm for larger matrices (like 6000*6000) performance was the same for all 4 libraries (supposedly due to memory bottleneck on my system).

@AngryLoki Thank you for taking the time to gather, visualize, and share these performance results! Don't worry; a proper zen4 subconfiguration will be added to vanilla BLIS in the future. We are just overwhelmed with work these days relative to our resources. Thanks for your patience in the meantime. ❤️

PS: Please feel free to keep up with us in our Discord server, if you haven't already joined! 😄

@AngryLoki thank you for this information.

I am curious, did you also test AMD/blis compiled with AOCC? I've been experimenting with it on my system (Gentoo AMD 7840U) and it's performing well on certain tasks.

@HaukurPall , checked sgemm (M=N=K) with gcc 13.2.1 (+full lto), clang 17.0.6, AOCC and rocm-llvm-alt. Results are the same, almost the same.
compilers

I checked the code of AOCC and unfortunately I don't see any specific optimizations... AMD just shipped vanilla precompiled Clang and included some ROCm-related fixed (to make it work, not for optimization). Also they added ROCm/llvm-project@0272bec - if you attempt to use -famd-opt, it tries to use for proprietary version of Clang - rocm-llvm-alt - which actually has some interesting optimizations. However even after installing rocm-llvm-alt I was not able to increase performance for AOCL-BLAS. Anyways, ICX, AOCC and rocm-llvm-alt are basically Clang. With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

Regarding my previous tests, I checked my approach more carefully and found few misses from my side:

  • Specifying 32 threads on 16 core (32 vCPU) was a mistake. While it seemed that performance was the same, standard deviation was too big. After setting to 16, there is almost no variance (see image above).
  • Now tested with trunk OpenBLAS, trunk BLIS, trunk amd/blis, and MKL 2024.0.
  • Checked, why MKL is so slow and guess what, Intel did it again (as they always do), they shipped if cpu = zen: use slow code, we shipped extra megabytes specifically to degrade AMD performance. Followed https://documentation.sigma2.no/jobs/mkl.html#forcing-mkl-to-use-best-performing-routines and it made MKL 2 times faster.
  • Updated results on image below, everything was compiled with Clang and launched with OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY=0-15
    results

BLIS is usually pretty insensitive to compiler since most of the work happens in the inline assembly kernels.

With -flto they produce LLVM bitcode, which contains mostly x86-64 assembly of kernels, because Clang can't deconstruct inline asm back to optimizable LLVM representation.

I consider this a good thing since LLVM (and to fair other compilers too) really make a hash of C or intrinsics kernels due to a combination of poor register allocation and instruction ordering.

Glad to see that AOCL-BLIS is performing well for you though. As we work with AMD to backport their changes BLIS will catch up.

@AngryLoki thank you so much for this, this answers a lot of questions!