CNugteren / CLBlast

Tuned OpenCL BLAS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM

cunyangwei opened this issue · comments

I build CLBLAST for android. Although it can run in Adreno(tm) 740, I found that performance for HGEMM dose not have a significant sppedup. For example, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 16 --device 0 --platform 0 ,

the performance is 604.8 GFLOPS.

However, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 32 --device 0 --platform 0 ,

the performance is 462.8 GFLOPS.

It that correct? Because I think the performance in HGEMM might have 1TFLOPS.

It could well be that your hardware is slower in FP16 compared to FP32, even though there are memory bandwidth savings by using less data. However, it can also be that the CLBlast FP16 code is sub-optimal. One thing I suggest you to do is compile and run the tuners (see the docs), in particular for FP16, and perhaps even for the 4Kx4K matrices you are interested in. That should reveal whether you can achieve the 1TFLOPS with your device.