intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Windows] Qwen1.5-7B 8K支持

juan-OY opened this issue · comments

Qwen1.5-7B 8K输入下会OOM ,当修改qwen1.5\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py
#logits = logits.float() 可以运行,但是memory降低很多,是否对模型其他方面有影响。
是否能优化这个模型的整体memory消耗.

Update from the user: 同样测试条件6k左右输入,memory占用Qwen1-7B 9.03GB,Qwen1.5-7B 13.5GB, 且8K输入下会OOM, memory待优化

This memory difference is mainly led by logits = logtis.float(), which only used in qwen2
https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/modeling_qwen.py#L1061
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py#L1172

Test results:

  • 1k-128, w4a16, with low memory mode, cpu_embedding=False
  Load model Peak memory Peak memory(delete logits.float())
Qwen 4.99G 6.42G 6.42G
Qwen1.5 5.64G 7.67G 7.08G
  • 6k-512, w4a16, with low memory mode, cpu_embedding=False
  Load model Peak memory Peak memory(delete logits.float())
Qwen 4.99G 9.37G 9.37G
Qwen1.5 5.64G 13.75G 10.18G

After removing logits.float(), there are still some increasement on memory usage except model size difference, which may be led by some cached tensor whose size increases as the input dimension increases. We can look into the details if needed.

Note that the load model size difference mainly led by (rotary_emb): Qwen2RotaryEmbedding() layer. Because there are 32 rotary_emb layers in qwen1.5 while just 1 layer in qwen, which results in 500-600MB memory usage difference.

For the concern of deleting logits.float, we may run some accuracy benchmarks (e.g. ppl/ceval) later for further verification.

As users' feedback said, they hope that the memory usage does not exceed 10GB for 8k-512 input-output pair. Since users will observe the total memory usage of the CPU and GPU, the method of putting the embedding in the CPU may not be applicable.