Problem with 'generate_with_static_shape' function
xduzhangjiayu opened this issue · comments
Hi, I have some questions about the LLM inference on NPU.
I first quantized the Qwen-7B model into INT4 format using script/export.py, then I tested the performance of the LLM inference using script/profile_llm.py. I found that the generation function used in this file is 'generate_with_static_shape' and I have the following questions:
- What is the difference between 'generate_with_static_shape' and 'model.generate()'? I know from the documentation that 'generate_with_static_shape' seems to speed up inference on NPU, but I've tested that the speed of LLM is almost the same between 'generate_with_static_shape' and 'model.generate()'.
- 'generate_with_static_shape' has a parameter use_past. If I set it to True, it seems to generate the wrong token. If I set it to False, it looks correct but is extremely slow.
Any comment or advice is appreciated, thank you!
Hi,
For question 2, I think maybe the problem is the 'position_ids' in following lines aren't correct, i've changed it to
out = model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, )
now it seems can generate correct answer with use_past=True
Thanks, you are right. Feel free to open a PR to fix it
Hi,
I already do a pull request, please check it, thanks!
Merged, thankyou very much for your contribution!