[Windows] Qwen1.5-7B 8K支持

Question

[Windows] Qwen1.5-7B 8K支持

juan-OY opened this issue a month ago · comments

Qwen1.5-7B 8K输入下会OOM ，当修改qwen1.5\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py
#logits = logits.float() 可以运行，但是memory降低很多，是否对模型其他方面有影响。
是否能优化这个模型的整体memory消耗.

Kai Huang · Answer 1 · Fri Jun 07 2024 11:35:54 GMT+0800 (China Standard Time)

Update from the user: 同样测试条件6k左右输入，memory占用Qwen1-7B 9.03GB，Qwen1.5-7B 13.5GB，且8K输入下会OOM, memory待优化

Cengguang Zhang · Answer 2 · Tue Jun 11 2024 15:39:44 GMT+0800 (China Standard Time)

This memory difference is mainly led by logits = logtis.float(), which only used in qwen2
https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/modeling_qwen.py#L1061
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py#L1172

Test results:

1k-128, w4a16, with low memory mode, cpu_embedding=False

	Load model	Peak memory	Peak memory(delete logits.float())
Qwen	4.99G	6.42G	6.42G
Qwen1.5	5.64G	7.67G	7.08G

6k-512, w4a16, with low memory mode, cpu_embedding=False

	Load model	Peak memory	Peak memory(delete logits.float())
Qwen	4.99G	9.37G	9.37G
Qwen1.5	5.64G	13.75G	10.18G

After removing logits.float(), there are still some increasement on memory usage except model size difference, which may be led by some cached tensor whose size increases as the input dimension increases. We can look into the details if needed.

Note that the load model size difference mainly led by (rotary_emb): Qwen2RotaryEmbedding() layer. Because there are 32 rotary_emb layers in qwen1.5 while just 1 layer in qwen, which results in 500-600MB memory usage difference.

Kai Huang · Answer 3 · Wed Jun 12 2024 10:11:02 GMT+0800 (China Standard Time)

For the concern of deleting logits.float, we may run some accuracy benchmarks (e.g. ppl/ceval) later for further verification.

Cengguang Zhang · Answer 4 · Wed Jun 12 2024 11:06:37 GMT+0800 (China Standard Time)

As users' feedback said, they hope that the memory usage does not exceed 10GB for 8k-512 input-output pair. Since users will observe the total memory usage of the CPU and GPU, the method of putting the embedding in the CPU may not be applicable.