intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with 'generate_with_static_shape' function

xduzhangjiayu opened this issue · comments

Hi, I have some questions about the LLM inference on NPU.

I first quantized the Qwen-7B model into INT4 format using script/export.py, then I tested the performance of the LLM inference using script/profile_llm.py. I found that the generation function used in this file is 'generate_with_static_shape' and I have the following questions:

  1. What is the difference between 'generate_with_static_shape' and 'model.generate()'? I know from the documentation that 'generate_with_static_shape' seems to speed up inference on NPU, but I've tested that the speed of LLM is almost the same between 'generate_with_static_shape' and 'model.generate()'.
  2. 'generate_with_static_shape' has a parameter use_past. If I set it to True, it seems to generate the wrong token. If I set it to False, it looks correct but is extremely slow.

Any comment or advice is appreciated, thank you!

Hi,
For question 2, I think maybe the problem is the 'position_ids' in following lines aren't correct, i've changed it to
out = model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, )
now it seems can generate correct answer with use_past=True

Thanks, you are right. Feel free to open a PR to fix it

Hi,
I already do a pull request, please check it, thanks!

Merged, thankyou very much for your contribution!