running qwq-32b-awq with A770 * 2 is extremely slow, only 6t/s
jzhang-git opened this issue · comments
Hello!
When I use double A770 to run qwq-32b-awq, the performance is extremely low, only 4-6 token/s.
I tried multiple versions of the image but it didn't work. This is my most recent test.
Here is hardware environment:
CPU: AMD Ryzen 7 5700X3D
GPU: A770 (16G) * 2
RAM: DDR4 64G
GPU PCIe lanes: PCIe 4.0*8 * 2
Software environment:
OS: Ubuntu 22.04-LTS
Kernel: 6.5.0-35-generic
Docker: intelanalytics/ipex-llm-serving-xpu:2.2.0-b12-usm
Docker startup command:
sudo cpupower frequency-set -d 3.8GHz
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:2.2.0-b12-usm
export CONTAINER_NAME=ipex-vllm-2-b12-usm
sudo docker rm -f $CONTAINER_NAME
sudo docker run -itd \
--privileged \
--net=host \
--device=/dev/dri \
--name=$CONTAINER_NAME \
-v /mnt/d/AI/models/vllm:/models/ \
-v /mnt/d/AI/workspace/vllm:/workspace \
--memory="32g" \
--shm-size="16g" \
$DOCKER_IMAGE
Environment check inside docker:
-----------------------------------------------------------------
PYTHON_VERSION=3.11.11
-----------------------------------------------------------------
transformers=4.48.3
-----------------------------------------------------------------
torch=2.1.0.post2+cxx11.abi
-----------------------------------------------------------------
ipex-llm DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/oneccl_bind_pt-2.1.300+xpu-py3.11-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Version: 2.2.0b20250120
-----------------------------------------------------------------
ipex=2.1.30.post0
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 5700X3D 8-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 2
Frequency boost: disabled
CPU max MHz: 5254.6870
CPU min MHz: 2200.0000
-----------------------------------------------------------------
Total CPU Memory: 62.6881 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.4 LTS \n \l
-----------------------------------------------------------------
Linux jzhang-ubuntu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
env_check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
env_check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu 1.6.31294.12 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-dev 1.14.0-744~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
Command to launch qwq-32b-awq:
#!/bin/bash
model="/models/qwq-32b-awq"
served_model_name="test_model"
export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--quantization awq \
--load-in-low-bit asym_int4 \
--max-model-len 2000 \
--max-num-batched-tokens 3000 \
--max-num-seqs 256 \
--tensor-parallel-size 2 \
--disable-async-output-proc \
--distributed-executor-backend ray
Startup logs:
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2025-04-01 10:23:20,260 - INFO - intel_extension_for_pytorch auto imported
INFO 04-01 10:23:21 api_server.py:529] vLLM API server version 0.6.2+ipexllm
INFO 04-01 10:23:21 api_server.py:530] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, load_in_low_bit='asym_int4', model='/models/qwq-32b-awq', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2000, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=3000, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['test_model'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 04-01 10:23:21 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/fd5d5965-4b58-4bd3-8210-956bd04933f2 for IPC Path.
INFO 04-01 10:23:21 api_server.py:180] Started engine process with PID 5003
INFO 04-01 10:23:21 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
WARNING 04-01 10:23:21 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2025-04-01 10:23:24,838 - INFO - intel_extension_for_pytorch auto imported
INFO 04-01 10:23:26 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
WARNING 04-01 10:23:26 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2025-04-01 10:23:27,583 INFO worker.py:1841 -- Started a local Ray instance.
INFO 04-01 10:23:28 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/models/qwq-32b-awq', speculative_config=None, tokenizer='/models/qwq-32b-awq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=test_model, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 04-01 10:23:28 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=5315) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=5315) warn(
(pid=5315) 2025-04-01 10:23:31,406 - INFO - intel_extension_for_pytorch auto imported
(pid=5313) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=5313) warn(
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 04-01 10:23:36 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 04-01 10:23:36 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=5313) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=5313) INFO 04-01 10:23:36 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=5313) INFO 04-01 10:23:36 selector.py:138] Using IPEX attention backend.
INFO 04-01 10:23:36 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7054b8633810>, local_subscribe_port=32875, remote_subscribe_port=None)
INFO 04-01 10:23:36 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 04-01 10:23:36 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=5313) INFO 04-01 10:23:36 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=5313) INFO 04-01 10:23:36 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:01, 3.98it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:00, 3.14it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.85it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:01<00:00, 2.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00, 2.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00, 2.86it/s]
2025-04-01 10:23:38,895 - INFO - Converting the current model to asym_int4 format......
2025-04-01 10:23:38,895 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=5313) 2025-04-01 10:23:48,632 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=5313) 2025-04-01 10:23:48,632 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=5313) 2025-04-01 10:23:35,477 - INFO - intel_extension_for_pytorch auto imported
2025-04-01 10:24:08,856 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-04-01 10:24:12,469 - INFO - Loading model weights took 9.1201 GB
(WrapperWithLoadBit pid=5313) 2025-04-01 10:25:23,664 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=5313) 2025-04-01 10:25:29,819 - INFO - Loading model weights took 9.1201 GB
2025:04:01-10:25:31:( 5003) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2025:04:01-10:25:31:( 5003) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 2, byte_count: 30720000
(WrapperWithLoadBit pid=5313) 2025:04:01-10:25:31:( 5313) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=5313) 2025:04:01-10:25:31:( 5313) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=5313) -----> current rank: 1, world size: 2, byte_count: 30720000
WARNING 04-01 10:26:02 utils.py:747] Pin memory is not supported on XPU.
INFO 04-01 10:26:02 distributed_gpu_executor.py:57] # GPU blocks: 3573, # CPU blocks: 4096
(WrapperWithLoadBit pid=5313) WARNING 04-01 10:26:02 utils.py:747] Pin memory is not supported on XPU.
INFO 04-01 10:26:04 api_server.py:233] vLLM to use /tmp/tmpwmym9wxm as PROMETHEUS_MULTIPROC_DIR
WARNING 04-01 10:26:04 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 04-01 10:26:04 launcher.py:19] Available routes are:
INFO 04-01 10:26:04 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 04-01 10:26:04 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 04-01 10:26:04 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-01 10:26:04 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 04-01 10:26:04 launcher.py:27] Route: /health, Methods: GET
INFO 04-01 10:26:04 launcher.py:27] Route: /tokenize, Methods: POST
INFO 04-01 10:26:04 launcher.py:27] Route: /detokenize, Methods: POST
INFO 04-01 10:26:04 launcher.py:27] Route: /v1/models, Methods: GET
INFO 04-01 10:26:04 launcher.py:27] Route: /version, Methods: GET
INFO 04-01 10:26:04 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 04-01 10:26:04 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 04-01 10:26:04 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [4902]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test result:
INFO 04-01 10:27:17 engine.py:288] Added request cmpl-f8133b90b62a474ea8b62e657c6b0790-0.
INFO 04-01 10:27:20 metrics.py:351] Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 04-01 10:27:25 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 04-01 10:27:30 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 04-01 10:27:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 04-01 10:27:40 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 04-01 10:27:45 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
I have locked the kernel to 6.5.0-generic and redeployed the environment by following the steps outlined in the case https://www.intel.cn/content/www/cn/zh/customer-spotlight/cases/keep-the-cost-below-rmb-6w-to-run-deepseek.html. I achieved a speed of 15t/s with the image intelanalytics/ipex-llm-serving-xpu:2.2.0-b9. It still falls short of the 30t/s mentioned in #12190. Perhaps the difference is due to the model being used? Nevertheless, this performance is acceptable.
Hi, we can achieve around 28.5 token per second with image intelanalytics/ipex-llm-serving-xpu:2.2.0-b16 for throughput.
Can you show us the test case you use for the throughput test so that we can reproduce the issue?
Besides, the first request will be slow due to compilation. The followed requests will be faster.

Have you installed the out-of-tree driver?
Can you show the result of modinfo i915 | grep filename?
the result of modinfo i915 | grep filename is
filename: /lib/modules/6.5.0-35-generic/updates/dkms/i915.ko
I tried the image intelanalytics/ipex-llm-serving-xpu:2.2.0-b16, but it reported an error. The error message is:
(WrapperWithLoadBit pid=5688) -----> current rank: 1, world size: 2, byte_count: 30720000
(WrapperWithLoadBit pid=5686) INFO 04-02 21:31:00 __init__.py:180] Automatically detected platform xpu.
ERROR 04-02 21:32:52 worker_base.py:469] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
ERROR 04-02 21:32:52 worker_base.py:469] Traceback (most recent call last):
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461, in execute_method
ERROR 04-02 21:32:52 worker_base.py:469] return executor(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106, in determine_num_available_blocks
ERROR 04-02 21:32:52 worker_base.py:469] self.model_runner.profile_run()
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 856, in profile_run
ERROR 04-02 21:32:52 worker_base.py:469] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 966, in execute_model
ERROR 04-02 21:32:52 worker_base.py:469] hidden_or_intermediate_states = model_executable(
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 04-02 21:32:52 worker_base.py:469] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 04-02 21:32:52 worker_base.py:469] return self.forward(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 04-02 21:32:52 worker_base.py:469] hidden_states, residual = layer(
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
ERROR 04-02 21:32:52 worker_base.py:469] hidden_states = self.self_attn(
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 176, in forward
ERROR 04-02 21:32:52 worker_base.py:469] attn_output = self.attn(q,
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/attention/layer.py", line 134, in forward
ERROR 04-02 21:32:52 worker_base.py:469] return self.impl.forward(query,
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/attention/backends/ipex_attn.py", line 449, in forward
ERROR 04-02 21:32:52 worker_base.py:469] sub_out = xe_addons.sdp_causal(
ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-02 21:32:52 worker_base.py:469] RuntimeError: The program was built for 1 devices
ERROR 04-02 21:32:52 worker_base.py:469] Build program log for 'Intel(R) Arc(TM) A770 Graphics':
ERROR 04-02 21:32:52 worker_base.py:469]
2025-04-02 21:32:52,142 - ERROR - The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 221, in from_engine_args
return super().from_engine_args(engine_args, usage_context, ipc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 276, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 416, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 516, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 470, in execute_method
raise e
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 856, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 966, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 176, in forward
attn_output = self.attn(q,
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/layer.py", line 134, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/backends/ipex_attn.py", line 449, in forward
sub_out = xe_addons.sdp_causal(
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 242, in run_mp_engine
raise e # noqa
^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 221, in from_engine_args
return super().from_engine_args(engine_args, usage_context, ipc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 276, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 416, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 516, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 470, in execute_method
raise e
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 856, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 966, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 176, in forward
attn_output = self.attn(q,
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/layer.py", line 134, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/backends/ipex_attn.py", line 449, in forward
sub_out = xe_addons.sdp_causal(
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
2025-04-02 21:32:52,166 ERROR worker.py:420 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WrapperWithLoadBit.execute_method() (pid=5688, ip=192.168.3.21, actor_id=068fbb554597228a33c4ea4b01000000, repr=<ipex_llm.vllm.xpu.ipex_llm_wrapper.WrapperWithLoadBit object at 0x71050befac10>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 470, in execute_method
raise e
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 856, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 966, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
return self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 176, in forward
attn_output = self.attn(q,
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/layer.py", line 134, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/attention/backends/ipex_attn.py", line 449, in forward
sub_out = xe_addons.sdp_causal(
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] Traceback (most recent call last):
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 461, in execute_method
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return executor(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_worker.py", line 106, in determine_num_available_blocks
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] self.model_runner.profile_run()
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 856, in profile_run
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] self.execute_model(model_input, kv_caches, intermediate_tensors)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return func(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/xpu_model_runner.py", line 966, in execute_model
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] hidden_or_intermediate_states = model_executable(
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] hidden_states = self.model(input_ids, positions, kv_caches,
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self.forward(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] hidden_states, residual = layer(
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] hidden_states = self.self_attn(
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/qwen2.py", line 176, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] attn_output = self.attn(q,
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self._call_impl(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return forward_call(*args, **kwargs)
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/attention/layer.py", line 134, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] return self.impl.forward(query,
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] File "/usr/local/lib/python3.11/dist-packages/vllm/attention/backends/ipex_attn.py", line 449, in forward
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] sub_out = xe_addons.sdp_causal(
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] ^^^^^^^^^^^^^^^^^^^^^
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] RuntimeError: The program was built for 1 devices
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469] Build program log for 'Intel(R) Arc(TM) A770 Graphics':
(WrapperWithLoadBit pid=5688) ERROR 04-02 21:32:52 worker_base.py:469]
2025-04-02 21:34:12,588 - ERROR - Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/zmq/_future.py", line 372, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 906, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 865, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 123, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Could you please give me documentation for the version of the image?
You can find the documentation for the image here: https://github.com/intel/ipex-llm/tree/main/docker/llm/serving/xpu/docker
This problem is wired, it seems to be related to compilation problem. I cannot reproduce the issue on our devices.
I begin to think that this may be related to hardware config..
Have you installed the out-of-tree driver?
Can you show the result of
modinfo i915 | grep filename?
This is the correct driver version, so driver is not the issue.
You can find the documentation for the image here: https://github.com/intel/ipex-llm/tree/main/docker/llm/serving/xpu/docker
This problem is wired, it seems to be related to compilation problem. I cannot reproduce the issue on our devices.
I begin to think that this may be related to hardware config..
I built the Docker image according to the content of the link, and after starting the service, an error was reported. The error content is:
Starting service with model: /models/deepseek-r1__1_5b
Served model name: deepseek-r1:1.5b
Tensor parallel size: 1
INFO 04-10 19:01:12 __init__.py:180] Automatically detected platform xpu.
WARNING 04-10 19:01:12 api_server.py:893] Warning: Please use `ipex_llm.vllm.xpu.entrypoints.openai.api_server` instead of `vllm.entrypoints.openai.api_server` to start the API server
INFO 04-10 19:01:12 api_server.py:837] vLLM API server version 0.6.6+ipexllm
INFO 04-10 19:01:12 api_server.py:838] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/deepseek-r1__1_5b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=3000, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1:1.5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, low_bit_model_path=None, low_bit_save_path=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, load_in_low_bit='fp8')
INFO 04-10 19:01:12 api_server.py:197] Started engine process with PID 279
WARNING 04-10 19:01:12 config.py:2289] Casting torch.bfloat16 to torch.float16.
INFO 04-10 19:01:16 __init__.py:180] Automatically detected platform xpu.
INFO 04-10 19:01:16 config.py:521] This model supports multiple tasks: {'reward', 'embed', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
2025-04-10 19:01:16,838 - ERROR - cannot import name 'intel' from 'triton._C.libtriton' (/usr/local/lib/python3.11/dist-packages/triton/_C/libtriton.so)
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 220, in from_engine_args
_ipex_llm_convert(load_in_low_bit)
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 71, in _ipex_llm_convert
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 20, in <module>
from vllm.v1.attention.backends.flash_attn import (FlashAttentionBackend,
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 7, in <module>
import triton
File "/usr/local/lib/python3.11/dist-packages/triton/__init__.py", line 8, in <module>
from .runtime import (
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/__init__.py", line 1, in <module>
from .autotuner import (Autotuner, Config, Heuristics, autotune, heuristics)
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/autotuner.py", line 9, in <module>
from .jit import KernelInterface
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/jit.py", line 12, in <module>
from ..runtime.driver import driver
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/driver.py", line 1, in <module>
from ..backends import backends
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 50, in <module>
backends = _discover_backends()
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 43, in _discover_backends
compiler = _load_module(name, os.path.join(root, name, 'compiler.py'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 12, in _load_module
spec.loader.exec_module(module)
File "/usr/local/lib/python3.11/dist-packages/triton/backends/intel/compiler.py", line 2, in <module>
from triton._C.libtriton import ir, passes, llvm, intel
ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/usr/local/lib/python3.11/dist-packages/triton/_C/libtriton.so)
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 242, in run_mp_engine
raise e # noqa
^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 220, in from_engine_args
_ipex_llm_convert(load_in_low_bit)
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 71, in _ipex_llm_convert
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 20, in <module>
from vllm.v1.attention.backends.flash_attn import (FlashAttentionBackend,
File "/usr/local/lib/python3.11/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 7, in <module>
import triton
File "/usr/local/lib/python3.11/dist-packages/triton/__init__.py", line 8, in <module>
from .runtime import (
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/__init__.py", line 1, in <module>
from .autotuner import (Autotuner, Config, Heuristics, autotune, heuristics)
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/autotuner.py", line 9, in <module>
from .jit import KernelInterface
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/jit.py", line 12, in <module>
from ..runtime.driver import driver
File "/usr/local/lib/python3.11/dist-packages/triton/runtime/driver.py", line 1, in <module>
from ..backends import backends
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 50, in <module>
backends = _discover_backends()
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 43, in _discover_backends
compiler = _load_module(name, os.path.join(root, name, 'compiler.py'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/triton/backends/__init__.py", line 12, in _load_module
spec.loader.exec_module(module)
File "/usr/local/lib/python3.11/dist-packages/triton/backends/intel/compiler.py", line 2, in <module>
from triton._C.libtriton import ir, passes, llvm, intel
ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/usr/local/lib/python3.11/dist-packages/triton/_C/libtriton.so)
The command sycl-ls returns:
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[level_zero:gpu][level_zero:1] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.32224.500000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 7 5700X3D 8-Core Processor OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.52.32224.5]
env_check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.12
-----------------------------------------------------------------
transformers=4.51.1
-----------------------------------------------------------------
torch=2.6.0+xpu
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250407
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 5700X3D 8-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 2
Frequency boost: disabled
CPU max MHz: 5254.6870
CPU min MHz: 2200.0000
-----------------------------------------------------------------
Total CPU Memory: 62.6881 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.5 LTS \n \l
-----------------------------------------------------------------
Linux jzhang-ubuntu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
env_check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
env_check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu 1.6.32224.5 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-dev 1.16.15-881~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
Is there something wrong with my CPU?