intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen-7B-Chat fail with larger 6.7k for second or 3rd time

juan-OY opened this issue · comments

MTL running One task with -i 6707 -o 160
it shows OOM on MTL, while the similar command can pass in the previous testing.

Traceback (most recent call last):
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 241, in
infer_test(model, tokenizer, input_token_num, output_token_num, total_speed_file)
File "C:\multi-modality\cvte_qwen\ultra_test_code_and_data\benchmark_test2intel\speed_test_ultra.py", line 108, in infer_test
prefill_output = model(**model_inputs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel/.cache\huggingface\modules\transformers_modules\Qwen-7B-Chat-sym_int4\modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Intel\miniconda3\envs\qwen\lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 703, in forward
result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype,
RuntimeError: XPU out of memory. Tried to allocate 2.37 GiB (GPU 0; 14.48 GiB total capacity; 6.94 GiB already allocated; 8.04 GiB reserved in total by PyTorch)

To minimize MTL's memory usage, you can put embedding on cpu memory by setting cpu_embedding=True when calling from_pretrained or load_low_bit. Qwen's embedding is about 1GB.

we can close it, issue can not be reproduced again