Qwen-7B-Chat fail with larger 6.7k for second or 3rd time

Question

Qwen-7B-Chat fail with larger 6.7k for second or 3rd time

juan-OY opened this issue 2 months ago · comments

MTL running One task with -i 6707 -o 160
it shows OOM on MTL, while the similar command can pass in the previous testing.

Traceback (most recent call last):
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 241, in
infer_test(model, tokenizer, input_token_num, output_token_num, total_speed_file)
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 108, in infer_test
prefill_output = model(**model_inputs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel/.cache\huggingface\modules\transformers_modules\Qwen-7B-Chat-sym_int4\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 703, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: XPU out of memory. Tried to allocate 2.37 GiB (GPU 0; 14.48 GiB total capacity; 6.94 GiB already allocated; 8.04 GiB reserved in total by PyTorch)

Xin Qiu · Answer 1 · Mon May 27 2024 14:29:45 GMT+0800 (China Standard Time)

To minimize MTL's memory usage, you can put embedding on cpu memory by setting cpu_embedding=True when calling from_pretrained or load_low_bit. Qwen's embedding is about 1GB.

juan-OY · Answer 2 · Mon May 27 2024 21:38:09 GMT+0800 (China Standard Time)

we can close it, issue can not be reproduced again