flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The reproduction of performance result on A64fx

zhuangbility111 opened this issue · comments

Hi, I have been looking at the performance result on A64fx, and trying to reproduce it. According to the result you proposed, the Fujitsu SSL2's sgemm performance is pretty good. Could you please tell me what sgemm subroutine did you use on Fujitsu SSL2?

Maybe @xrq-phys would be able to answer this question, since he was the one who gathered the data.

Hi @zhuangbility111 .

I suppose you're asking about SSL2 part, not BLIS, right?

The API used here is just standard BLAS API for SSL2.

SSL2 is completely private. I can reach no internal. But according to my own observation on SC Fugaku (that machine has the most recent toolchain updates, as everyone'd expect), SSL2 had a performance update this March (toolchain version 1.30, FCC version 4.4 if I remember) raising virtually all gemm performances. Perhaps you may want to check the library version.

Another thing could be the clock. A64FX operates on 4 clock modes: 1.6, 1.8, 2.0 and 2.2GHz. Those benchmarks are performed on the max 2.2GHz. On the other hand, we've observed these performance scaling linearly with the clock (provided that our testing platforms all seem to have enough power supply).

Hope these info helps.

Btw perhaps one more advertisement: There's also a (FP16) shgemm SVE kernel here in BLIS. I can port it to a sandbox if your application is interested in :D

Hi @xrq-phys
Thanks for your reply!

The API used here is just standard BLAS API for SSL2.

Which version of API do you use? I have read the Fugaku SSL2's documentation. And I found that there are two version gemm in SSL2(toolchain version 1.33), Fortran version and cblas version. After testing them(under the same frequency 2.0GHz) I found that there are some difference on performance between Fortran version and cblas version. So I'm just wondering which version do you use when you obtain the performance result. hhhh

I see.
The test uses Fortran BLAS API.

Did you observe cblas to be better?

Actually the performance depends on the specific case. In some case cblas version is better. But I think the fortran version perform better among the most cases.

I found that the fortran version doesn't support multi-cores? I used export OMP_NUM_THREADS to adjust the number of threads, it didn't work... How do you achieve this with fortran version?

cf. https://github.com/flame/blis/blob/master/docs/Performance.md#a64fx-experiment-details

There is an env called NPARALLEL controlling -SSL2BLAMP threading behavior. OMP_NUMBER_THREADS should also work. I'm not sure about the specifics.

It's quite surprising that SSL2's cblas isn't a simple wrapper over Fortran one or vice versa.

Thanks! I have tried it again. The API of sgemm I used is VMGGM. But it didn't work.. Here is my compile options:

frt -Kfast,openmp -SSL2BLAMP -KSVE main.f

I guess they are developed by different teams. Maybe the cblas version is developed base on another open source library.

I'm afraid I've no idea what VMGGM is.

Btw, discussion happening here seems to be SSL2-specific thus out of BLIS' reach. @zhuangbility111 I believe we'd better discuss this elsewhere. You can use the institution email posted on my profile to reach me, and close this PR if OK. Thanks.

Thank you very much!