Problem with 'generate_with_static_shape' function

Question

Problem with 'generate_with_static_shape' function

xduzhangjiayu opened this issue a month ago · comments

Hi, I have some questions about the LLM inference on NPU.

I first quantized the Qwen-7B model into INT4 format using script/export.py, then I tested the performance of the LLM inference using script/profile_llm.py. I found that the generation function used in this file is 'generate_with_static_shape' and I have the following questions:

What is the difference between 'generate_with_static_shape' and 'model.generate()'? I know from the documentation that 'generate_with_static_shape' seems to speed up inference on NPU, but I've tested that the speed of LLM is almost the same between 'generate_with_static_shape' and 'model.generate()'.
'generate_with_static_shape' has a parameter use_past. If I set it to True, it seems to generate the wrong token. If I set it to False, it looks correct but is extremely slow.

Any comment or advice is appreciated, thank you!

captainzz · Answer 1 · Tue Jun 18 2024 18:33:54 GMT+0800 (China Standard Time)

Hi,
For question 2, I think maybe the problem is the 'position_ids' in following lines aren't correct, i've changed it to
out = model( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, )
now it seems can generate correct answer with use_past=True

Alessandro Palla · Answer 2 · Tue Jun 18 2024 18:42:00 GMT+0800 (China Standard Time)

Thanks, you are right. Feel free to open a PR to fix it

captainzz · Answer 3 · Tue Jun 18 2024 21:41:48 GMT+0800 (China Standard Time)

Hi,
I already do a pull request, please check it, thanks!

Alessandro Palla · Answer 4 · Wed Jun 19 2024 02:15:04 GMT+0800 (China Standard Time)

Merged, thankyou very much for your contribution!