vllm-cpu bug - Qwen2Attention' object has no attribute 'kv_scale'
bratao opened this issue · comments
When running a Qwen1.5 model, it loads but have this error when serving:
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7099f9f78cc0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7099f9480f50>>)>
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/tasks.py", line 484, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
output = await make_async(self.driver_worker.execute_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 50, in _Attention_forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata, self.kv_scale)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1709, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2Attention' object has no attribute 'kv_scale'
Hi @bratao,
I have reproduced this error and it's a mismatch for our convert of Qwen2Attention.forward(). We are working to fix it, update here once it is solved.
Hi @bratao,
We have fixed this bug. Please install latest ipex-llm (2.1.0b20240610
or newer) or use our updated docker image intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
and try it again.