flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor DGEMM performance for armsve build on Neoverse N2

chrisgoodyer opened this issue · comments

Hi.

Whilst doing some comparative benchmarking on the Alibaba Cloud g8m instances I've run into some BLIS performance issues. g8m is based on Arm's Neoverse N2 technology and has 2x128-bit SVE vectors.

When I've done a build for the target "armsve" I am getting a peak performance of between 5 and 6 GFLOPs on a single core rather than the 20 GFLOPs I get from the Neon implementation.

There seems to be an awful lot of time spent in the function "bli_dpackm_mrxk_armsve_ref" which makes me think it is packing incorrectly for the 128-bit vector length. Running on AWS Graviton3 instances (with a 256-bit vector length) does not show these issues.

Thanks.

Chris

I think, of the currently-available configs, that ThunderX2 should perform best on N2. The SVE kernels are tuned for 256+ bit so I think you really want a neon kernel. A "real" Neoverse N1 kernel/configuration should be in master shortly.

Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units.

Apologies for this late response.

For Graviton 3, 2xSVE256 does better than 4xNEON by about 2% or so.

armsve is not suitable for 128-bit due to its lack of indexed FMA that would decrease assembly capacity for instruction latency, but 5~6 GFLOPS is unexpected (should be ~15.). A possible reason here is that your Neoverse N2 core does not implement hardware prefetching which is presumed for kernels/armsve. I do not know how Alibaba Cloud differs from like Amazon C7g and Oracle Ampere, but using NEON ones should be good for your machine.