Qwen-7B-Chat on Xeon+ARC770
jianweimama opened this issue · comments
Qwen-7B-Chat:Precision (INT4_SYM) + Input/output token (1024/128) can run on ARC with below number.
But Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128) can not run due to out of memory. As a comparison,
llama2-7B-Chat Precision (FP16) + Input/output token (1024/128) can run, is it expected? What causes the difference in memory usage between these two models?
--------------------log of Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128)-------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/benchmark/all-in-one/run.py", line 55, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 1563, in generate
return self.greedy_search(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 2385, in greedy_search
outputs = self(
^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 822, in forward
result = torch.ops.torch_ipex.matmul_bias_out(x, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 692, in call
return self._op(*args, **kwargs or {})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Allocation is out of device memory on current platform.
Reproduced the issue in our env, which is expected. One of the reason that Qwen-7B use more memory than Llama2-7B is vocabulary size. Therefore, if you want to run fp16 precision, you could set cpu_embedding: True
in config.yaml
, using transformer_int4_fp16_gpu
API. There may be other reasons leading to memory usage difference, you can also try low memory mode by using export IPEX_LLM_LOW_MEM=1
to save more memory.
with "cpu_embedding: True", Qwen-7B+fp16+1024/128 can work now.
thanks for help!