intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen-7B-Chat on Xeon+ARC770

jianweimama opened this issue · comments

Qwen-7B-Chat:Precision (INT4_SYM) + Input/output token (1024/128) can run on ARC with below number.
image

But Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128) can not run due to out of memory. As a comparison,
llama2-7B-Chat Precision (FP16) + Input/output token (1024/128) can run, is it expected? What causes the difference in memory usage between these two models?

--------------------log of Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128)-------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/benchmark/all-in-one/run.py", line 55, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 1563, in generate
return self.greedy_search(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 2385, in greedy_search
outputs = self(
^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 822, in forward
result = torch.ops.torch_ipex.matmul_bias_out(x, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 692, in call
return self._op(*args, **kwargs or {})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Allocation is out of device memory on current platform.

Reproduced the issue in our env, which is expected. One of the reason that Qwen-7B use more memory than Llama2-7B is vocabulary size. Therefore, if you want to run fp16 precision, you could set cpu_embedding: True in config.yaml, using transformer_int4_fp16_gpu API. There may be other reasons leading to memory usage difference, you can also try low memory mode by using export IPEX_LLM_LOW_MEM=1 to save more memory.

with "cpu_embedding: True", Qwen-7B+fp16+1024/128 can work now.

thanks for help!