Qwen-7B-Chat fail with larger 6.7k for second or 3rd time
juan-OY opened this issue · comments
MTL running One task with -i 6707 -o 160
it shows OOM on MTL, while the similar command can pass in the previous testing.
Traceback (most recent call last):
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 241, in
infer_test(model, tokenizer, input_token_num, output_token_num, total_speed_file)
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 108, in infer_test
prefill_output = model(**model_inputs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel/.cache\huggingface\modules\transformers_modules\Qwen-7B-Chat-sym_int4\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 703, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: XPU out of memory. Tried to allocate 2.37 GiB (GPU 0; 14.48 GiB total capacity; 6.94 GiB already allocated; 8.04 GiB reserved in total by PyTorch)
To minimize MTL's memory usage, you can put embedding on cpu memory by setting cpu_embedding=True
when calling from_pretrained
or load_low_bit
. Qwen's embedding is about 1GB.
we can close it, issue can not be reproduced again