Performance testing on AMD7502 (zen2)

Question

Performance testing on AMD7502 (zen2)

bartoldeman opened this issue 3 years ago · comments

Bart Oldeman commented 3 years ago

Hi,

FYI, I was curious how BLIS fares now against slightly newer versions of MKL/OpenBLAS and also AMD's fork.

Zen2

Zen2 experiment details

Location: https://docs.computecanada.ca/wiki/B%C3%A9luga/en
Processor model: AMD Epyc 7502 (Zen2 "Rome")
Core topology: two sockets, 4 Core Complex Dies (CCDs) per socket, 2 Core Complexes (CCX) per CCD, 4 cores per CCX, 64 cores total
SMT status: disabled
Max clock rate: 2.5GHz (base, documented); 3.35GHz boost (single-core, documented)
Max vector register length: 256 bits (AVX2)
Max FMA vector IPC: 2
- Alternatively, FMA vector IPC is 4 when vectors are limited to 128 bits each.
Peak performance:
- single-core: 53.6 GFLOPS (double-precision), 107.2 GFLOPS (single-precision)
- multicore (estimated): 40 GFLOPS/core (double-precision), 80 GFLOPS/core (single-precision)
Operating system: CentOS 7.9.2009 + Gentoo Prefix (May 2020)
Page size: 4096 bytes
Compiler: gcc 10.3.0
Results gathered: 23 September 2021
Implementations tested:
- BLIS 52f29f7 (0.8.1+) and AMD BLIS 3.0.1
  - configured with ./configure -t openmp auto (single- and multithreaded)
  - sub-configuration exercised: zen2
  - Single-threaded (1 core) execution requested via no change in environment variables
  - Multithreaded (32 core) execution requested via export BLIS_JC_NT=2 BLIS_IC_NT=4 BLIS_JR_NT=4
  - Multithreaded (64 core) execution requested via export BLIS_JC_NT=4 BLIS_IC_NT=4 BLIS_JR_NT=4
- OpenBLAS 0.3.17
  - compiled with make -j 8 libs netlib shared DYNAMIC_ARCH=1 DYNAMIC_LIST="HASWELL ZEN SKYLAKEX" NUM_THREADS=64 BINARY='64' CC='gcc' FC='gfortran' MAKE_NB_JOBS='-1' USE_OPENMP='1' USE_THREAD='1' CFLAGS='-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno'
  - Single-threaded (1 core) execution requested via export OMP_NUM_THREADS=1
  - Multithreaded (32 core) execution requested via export OMP_NUM_THREADS=32
  - Multithreaded (64 core) execution requested via export OMP_NUM_THREADS=64
- MKL 2021 update 2
  - Single-threaded (1 core) execution requested via export MKL_NUM_THREADS=1
  - Multithreaded (32 core) execution requested via export MKL_NUM_THREADS=32
  - Multithreaded (64 core) execution requested via export MKL_NUM_THREADS=64
Affinity:
- Thread affinity was specified manually via GOMP_CPU_AFFINITY="0-63".
- Single-threaded and 64 core executables were run through numactl --interleave=all; single socket through numactl --cpubind=0 --membind=0 to force execution on the first socket only.
Frequency throttling (via cpupower):
- Driver: acpi-cpufreq
- Governor: performance
- Hardware limits (steps): 1.5GHz, 2.2GHz, 2.5GHz
- Adjusted minimum: 1.5GHz
Comments:
- MKL performs poorly except for DGEMM, though much better than in https://github.com/flame/blis/blob/master/docs/Performance.md#zen2
- AMD-BLIS performs very similar to BLIS except for single-threaded *trsm (m,n,k<1000) and all-threaded zherk. Is that due to recent changes in BLIS?

Zen2 results

png (inline) black=BLIS, green=AMD BLIS, red=OpenBLAS, blue=MKL

Zen2 single-threaded
Zen2 multithreaded (32 cores)
Zen2 multithreaded (64 cores)

Devin Matthews · Answer 1 · Fri Sep 24 2021 23:08:14 GMT+0800 (China Standard Time)

@bartoldeman thanks for collecting this data, it looks great! For level-3 BLAS operations, "vanilla" BLIS and AMD BLIS should be essentially the same, with perhaps slightly better performance of AMD BLIS for small non-GEMM operations (although I think the only one not ported back so far is GEMMT which you didn't test). It seems like there might be thermal rate-limiting issues in some of these? Especially complex operations. IIRC the test driver runs from large to small problems so the dips on the multi-threaded complex gemm may be the processor lowering the frequency as it heats up. The thermal issue can also show up in other ways, e.g. if AMD BLIS is always tested after "vanilla" BLIS then it may get throttled more.

Devin Matthews · Answer 2 · Fri Sep 24 2021 23:10:10 GMT+0800 (China Standard Time)

@fgvanzee this gives me an idea: what if we modified the test driver to also count cycles (e.g. rdtsc on x86) and print FLOPS/cycle in addition to GFLOPs?

Bart Oldeman · Answer 3 · Sat Sep 25 2021 01:43:32 GMT+0800 (China Standard Time)

Thanks for the feedback! Not sure what affects single-threaded *trsm (m,n,k<1000) and all-threaded zherk though from that.

I also tested on Skylake-X (Intel 6148, dual socket 2x20 cores), I'll attach the pictures but will post details later.
IMHO, no big surprises here versus your tests, MKL wins overall but not everywhere, BLIS, AMD-BLIS pretty much the same everywhere, the differences look quite noisy.

1 thread

1 socket (jc2ic10jr1_nt20)

2 sockets (jc4ic10jr1_nt40)

Devin Matthews · Answer 4 · Sat Sep 25 2021 03:53:16 GMT+0800 (China Standard Time)

Yeah our SKX performance is not stellar. I wrote that kernel so it's totally my fault! 😄

Tyler Michael Smith · Answer 5 · Sat Sep 25 2021 04:10:22 GMT+0800 (China Standard Time)

@fgvanzee this gives me an idea: what if we modified the test driver to also count cycles (e.g. rdtsc on x86) and print FLOPS/cycle in addition to GFLOPs?

rdtsc is actually a measurement of time, not clock cycles! (it returns the number of nominal clock cycles elapsed, regardless of tubo boost, throttling &c)

Devin Matthews · Answer 6 · Sat Sep 25 2021 04:12:55 GMT+0800 (China Standard Time)

That is an insane design decision. Isn't there another easy instruction to read the actual number of cycles elapsed? It would be a pain to have to hook into PAPI or something.

Devin Matthews · Answer 7 · Sat Sep 25 2021 04:19:34 GMT+0800 (China Standard Time)

Looks like you have to use the PMU, so PAPI or similar is the only portable way. Too bad.

Robert van de Geijn · Answer 8 · Sat Sep 25 2021 04:31:34 GMT+0800 (China Standard Time)

Great project for an undergrad?

…

On Sep 24, 2021, at 3:19 PM, Devin Matthews ***@***.***> wrote: Looks like you have to use the PMU, so PAPI or similar is the only portable way. Too bad. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#548 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJYHLEKKET5Z4MVSYODUDTMOJANCNFSM5EV6SM7A>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Tyler Michael Smith · Answer 9 · Sat Sep 25 2021 04:41:16 GMT+0800 (China Standard Time)

Sorry for being the bearer of bad news. I don't know a better way.

Bhaskar Nallani · Answer 10 · Wed Sep 29 2021 00:16:21 GMT+0800 (China Standard Time)

gemm, trsm, gemmt are improved on zen2/3 and will be releasing as part of upcoming AMD-BLIS Release.