[Question]NPU is slower than CPU when computing some types of matmul.

Question

[Question]NPU is slower than CPU when computing some types of matmul.

Septend-fun opened this issue 2 months ago · comments

Hi experts. I tested Matmul(int8) with shape (1,4096)x(4096,4096) , and I got the latency 0.985 ms(NPU) . But when I tested matmul (implement by openvino) on CPU I got latency 0.75ms. The NPU is slower than CPU. Is it normal?
I didn't test matmul int8 on NPU using openvino, because it failed.

Is there a way that I can implement matmul op on NPU by myself directly without this repo or openvino? I note that matmul ops seems implements in NPU driver.
And is there a tool that I can test NPU's bandwidth peak?

Test environment: Intel Core Ultra 7 155H
Test code: https://github.com/intel/intel-npu-acceleration-library/blob/v1.1.0/script/profile_matmul.py
Test cmd: python profile_matmul.py -b 1 -c 4096 -k 4096 -q

Alessandro Palla · Answer 1 · Wed Jun 05 2024 15:09:00 GMT+0800 (China Standard Time)

In batch 1 matmuls are bandwidth bounded since there are no weighs reutilization. if you try bigger batches you'll see that NPU will gain the upper hand fairly quickly. Also what is the driver version you are using? Latest driver brought a significant speedup in quantized matmul operation

Septend · Answer 2 · Wed Jun 05 2024 15:16:15 GMT+0800 (China Standard Time)

Thanks for your reply. I'm using the latest driver. And is there a tool that I can test NPU's bandwidth peak?

Alessandro Palla · Answer 3 · Wed Jun 05 2024 15:40:03 GMT+0800 (China Standard Time)

The best tools I'm aware are SoCWatch and VTune: https://www.intel.com/content/www/us/en/docs/system-bring-up-toolkit/get-started-guide-windows/2022-1/intel-soc-watch-and-intel-vtune-profiler.html

Septend · Answer 4 · Wed Jun 05 2024 16:17:58 GMT+0800 (China Standard Time)

Thank you a lot. I'll try it.