NPU computation is not fully occupied while running LLM model

Question

NPU computation is not fully occupied while running LLM model

nigue3025 opened this issue 2 months ago · comments

HI,

I am benefit from this library to run several LLM models individually such as llama2-7b, mistral-7b or phi3 on NPU (Intel Core Ultra5 125H).

However, no matter what the model is in use, the NPU occupation is about 30%~45% which is not fully occupied as the token generation is under processing.

The following questions may not make sense but I'd like to ask:

Is it possible to pull up the NPU occupation for increasing LLM inference speed? Or is there any reason why the NPU occupation is limit to about 30%~45% while running the aforementioned LLＭ models?

Any comment or advice is appreciated!

Alessandro Palla · Answer 1 · Thu Jun 13 2024 18:05:53 GMT+0800 (China Standard Time)

Hi, not a silly question at all.

I wrote this guide on LLM performance that hopefully clarifies some points. In particular, LLM decoding inference (especially the kv-cached part) is very much DRAM bandwidth bounded so NPU utilization cannot reach 100% utilization. This is true in general for NPUs and GPUs as well it is not specific thing. Also consider that we are still working to improve many of those models so performance are expected to increase over the next releases

Bryan · Answer 2 · Thu Jun 13 2024 23:33:52 GMT+0800 (China Standard Time)

Hi, not a silly question at all.

I wrote this guide on LLM performance that hopefully clarifies some points. In particular, LLM decoding inference (especially the kv-cached part) is very much DRAM bandwidth bounded so NPU utilization cannot reach 100% utilization. This is true in general for NPUs and GPUs as well it is not specific thing. Also consider that we are still working to improve many of those models so performance are expected to increase over the next releases

Hi @alessandropalla
Really appreciate for your clear explain and your amazing guide.

Since the speed of DRAM matters, so it is better to ensure that the spec of my DRAM meet the maximum supported spec of the device(e.g. DDR5 5600 mt/s for core ultra5 125h) .

Owing to the memory swapping operation, ensuring the performance of SSD disk seems important also.

As I run llm model with NPU, I noticed that the RAM attributed for NPU is fairly low(approximately lower than 10%）but the CPU attributed RAM is relatively much more higher. Is that mean the memory size of the data compiled and compressed for NPU is relatively low? Can I infer that the memory used for the model is not large enough for executing kv cache swapping mechanism between ram and disk? How do I know it is time to upgrade my DRAM volume for avoiding extensive memory swap operation? And if I succesfully avoid massive memory swap operation, can I also significantly gain the token generation speed?

Once again, many thanks for reply!

Neel Gupta · Answer 3 · Sat Jul 06 2024 18:44:01 GMT+0800 (China Standard Time)

Can you visualize or get metrics on NPU throughput on Linux or Windows Task Manager only? Thanks!