[Windows] Qwen1.5-7B 8K支持
juan-OY opened this issue · comments
Qwen1.5-7B 8K输入下会OOM ,当修改qwen1.5\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py
#logits = logits.float() 可以运行,但是memory降低很多,是否对模型其他方面有影响。
是否能优化这个模型的整体memory消耗.
Update from the user: 同样测试条件6k左右输入,memory占用Qwen1-7B 9.03GB,Qwen1.5-7B 13.5GB, 且8K输入下会OOM, memory待优化
This memory difference is mainly led by logits = logtis.float()
, which only used in qwen2
https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/modeling_qwen.py#L1061
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py#L1172
Test results:
- 1k-128, w4a16, with low memory mode, cpu_embedding=False
Load model | Peak memory | Peak memory(delete logits.float()) | |
---|---|---|---|
Qwen | 4.99G | 6.42G | 6.42G |
Qwen1.5 | 5.64G | 7.67G | 7.08G |
- 6k-512, w4a16, with low memory mode, cpu_embedding=False
Load model | Peak memory | Peak memory(delete logits.float()) | |
---|---|---|---|
Qwen | 4.99G | 9.37G | 9.37G |
Qwen1.5 | 5.64G | 13.75G | 10.18G |
After removing logits.float()
, there are still some increasement on memory usage except model size difference, which may be led by some cached tensor whose size increases as the input dimension increases. We can look into the details if needed.
Note that the load model size difference mainly led by (rotary_emb): Qwen2RotaryEmbedding()
layer. Because there are 32 rotary_emb
layers in qwen1.5 while just 1 layer in qwen, which results in 500-600MB memory usage difference.
For the concern of deleting logits.float
, we may run some accuracy benchmarks (e.g. ppl/ceval) later for further verification.
As users' feedback said, they hope that the memory usage does not exceed 10GB for 8k-512 input-output pair. Since users will observe the total memory usage of the CPU and GPU, the method of putting the embedding in the CPU may not be applicable.