flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor DGEMM performance for armsve build on Neoverse N2

chrisgoodyer opened this issue · comments


Whilst doing some comparative benchmarking on the Alibaba Cloud g8m instances I've run into some BLIS performance issues. g8m is based on Arm's Neoverse N2 technology and has 2x128-bit SVE vectors.

When I've done a build for the target "armsve" I am getting a peak performance of between 5 and 6 GFLOPs on a single core rather than the 20 GFLOPs I get from the Neon implementation.

There seems to be an awful lot of time spent in the function "bli_dpackm_mrxk_armsve_ref" which makes me think it is packing incorrectly for the 128-bit vector length. Running on AWS Graviton3 instances (with a 256-bit vector length) does not show these issues.



I think, of the currently-available configs, that ThunderX2 should perform best on N2. The SVE kernels are tuned for 256+ bit so I think you really want a neon kernel. A "real" Neoverse N1 kernel/configuration should be in master shortly.

Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units.

Apologies for this late response.

For Graviton 3, 2xSVE256 does better than 4xNEON by about 2% or so.

armsve is not suitable for 128-bit due to its lack of indexed FMA that would decrease assembly capacity for instruction latency, but 5~6 GFLOPS is unexpected (should be ~15.). A possible reason here is that your Neoverse N2 core does not implement hardware prefetching which is presumed for kernels/armsve. I do not know how Alibaba Cloud differs from like Amazon C7g and Oracle Ampere, but using NEON ones should be good for your machine.