intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vllm-cpu bug - Qwen2Attention' object has no attribute 'kv_scale'

bratao opened this issue · comments

When running a Qwen1.5 model, it loads but have this error when serving:

handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7099f9f78cc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7099f9480f50>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 484, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 50, in _Attention_forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata, self.kv_scale)
                                                              ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2Attention' object has no attribute 'kv_scale'

Hi @bratao,
I have reproduced this error and it's a mismatch for our convert of Qwen2Attention.forward(). We are working to fix it, update here once it is solved.

Hi @bratao,
We have fixed this bug. Please install latest ipex-llm (2.1.0b20240610 or newer) or use our updated docker image intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT and try it again.