intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

ipex-llm vllm not support glm-edge-4b-chat model

junruizh2021 opened this issue · comments

Describe the bug
Using intelanalytics/ipex-llm-serving-xpu:2.2.0-b15 docker container, we started the vllm inference following the README in python/llm/example/GPU/vLLM-Serving. I encountered the following error:

****************************Usage Error************************
Currently, ipex-vllm does not support linear layers with skip_bias_add argument
2025-03-24 21:45:11,347 - ERROR - 

How to reproduce
Steps to reproduce the error:

  1. vllm docker image: intelanalytics/ipex-llm-serving-xpu:2.2.0-b15
  2. huggingface model: THUDM/glm-edge-4b-chat
  3. start inference like this: https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/vLLM-Serving#service

Screenshots

```shell root@vllm:/home/vllm# bash /usr/bin/vllm-server.sh

:: WARNING: setvars.sh has already been run. Skipping re-execution.
To force a re-execution of setvars.sh, use the '--force' option.
Using '--force' can result in excessive use of your environment variables.

usage: source setvars.sh [--force] [--config=file] [--help] [...]
--force Force setvars.sh to re-run, doing so may overload environment.
--config=file Customize env vars using a setvars.sh configuration file.
--help Display this help message and exit.
... Additional args are passed to individual env/vars.sh scripts
and should follow this script's arguments.

Some POSIX shells do not accept command-line options. In that case, you can pass
command-line options via the SETVARS_ARGS environment variable. For example:

$ SETVARS_ARGS="--config=config.txt" ; export SETVARS_ARGS
$ . path/to/setvars.sh

The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.

The oneAPI toolkits no longer support 32-bit libraries, starting with the 2025.0 toolkit release. See the oneAPI release notes for more details.

INFO 03-24 21:44:58 init.py:180] Automatically detected platform xpu.
WARNING 03-24 21:44:59 api_server.py:893] Warning: Please use ipex_llm.vllm.xpu.entrypoints.openai.api_server instead of vllm.entrypoints.openai.api_server to start the API server
INFO 03-24 21:44:59 api_server.py:837] vLLM API server version 0.6.6+ipexllm
INFO 03-24 21:44:59 api_server.py:838] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/vllm/AI-models/LLM/hf-model/glm-edge-4b-chat', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.75, num_gpu_blocks_override=None, max_num_batched_tokens=10240, max_num_seqs=12, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm-edge-4b-chat'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, low_bit_model_path=None, low_bit_save_path=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, load_in_low_bit='fp16')
WARNING 03-24 21:44:59 utils.py:1920] Found ulimit of 32768 and failed to automatically increasewith error current limit exceeds maximum limit. This can cause fd limit errors likeOSError: [Errno 24] Too many open files. Consider increasing with ulimit -n
INFO 03-24 21:44:59 api_server.py:197] Started engine process with PID 531
WARNING 03-24 21:44:59 config.py:2289] Casting torch.bfloat16 to torch.float16.
INFO 03-24 21:45:03 init.py:180] Automatically detected platform xpu.
WARNING 03-24 21:45:04 config.py:2289] Casting torch.bfloat16 to torch.float16.
INFO 03-24 21:45:04 config.py:521] This model supports multiple tasks: {'generate', 'reward', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-24 21:45:09 config.py:521] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 03-24 21:45:09 llm_engine.py:234] Initializing an LLM engine (v0.6.6+ipexllm) with config: model='/home/vllm/AI-models/LLM/hf-model/glm-edge-4b-chat', speculative_config=None, tokenizer='/home/vllm/AI-models/LLM/hf-model/glm-edge-4b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=glm-edge-4b-chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
INFO 03-24 21:45:09 xpu.py:27] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 03-24 21:45:09 selector.py:155] Using IPEX attention backend.
WARNING 03-24 21:45:09 _ipex_ops.py:12] Import error msg: No module named 'intel_extension_for_pytorch'
INFO 03-24 21:45:09 importing.py:14] Triton not installed or not compatible; certain GPU-related functions will not be available.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.19it/s]

2025-03-24 21:45:11,070 - INFO - Converting the current model to fp16 format......
2025-03-24 21:45:11,071 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-03-24 21:45:11,347 - ERROR -

****Usage Error
Currently, ipex-vllm does not support linear layers with skip_bias_add argument
2025-03-24 21:45:11,347 - ERROR -

***Call Stack
2025-03-24 21:45:11,491 - ERROR - Currently, ipex-vllm does not support linear layers with skip_bias_add argument
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 221, in from_engine_args
return super().from_engine_args(engine_args, usage_context, ipc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 273, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 36, in init
self._init_executor()
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/xpu_executor.py", line 22, in _init_executor
GPUExecutor._init_executor(self)
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 155, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 122, in _ipex_llm_load_model
optimize_model(self.model,
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/optimize.py", line 254, in optimize_model
model = ggml_convert_low_bit(model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 1123, in ggml_convert_low_bit
model, has_been_replaced = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 486, in _replace_with_low_bit_linear
is_linear, linear_args = is_linear_module(module)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 195, in is_linear_module
invalidInputError(module.skip_bias_add is not True, "Currently, ipex-vllm does not"
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
raise RuntimeError(errMsg)
RuntimeError: Currently, ipex-vllm does not support linear layers with skip_bias_add argument
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 242, in run_mp_engine
raise e # noqa
^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 234, in run_mp_engine
engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 221, in from_engine_args
return super().from_engine_args(engine_args, usage_context, ipc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 273, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 36, in init
self._init_executor()
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/xpu_executor.py", line 22, in _init_executor
GPUExecutor._init_executor(self)
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 155, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 122, in _ipex_llm_load_model
optimize_model(self.model,
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/optimize.py", line 254, in optimize_model
model = ggml_convert_low_bit(model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 1123, in ggml_convert_low_bit
model, has_been_replaced = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 732, in _replace_with_low_bit_linear
_, _flag = _replace_with_low_bit_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 486, in _replace_with_low_bit_linear
is_linear, linear_args = is_linear_module(module)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/convert.py", line 195, in is_linear_module
invalidInputError(module.skip_bias_add is not True, "Currently, ipex-vllm does not"
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
raise RuntimeError(errMsg)
RuntimeError: Currently, ipex-vllm does not support linear layers with skip_bias_add argument
^CTraceback (most recent call last):
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 865, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 123, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 216, in build_async_engine_client_from_engine_args
await mq_engine_client.setup()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 262, in setup
response = await self._wait_for_server_rpc(socket)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 369, in _wait_for_server_rpc
return await self._send_get_data_rpc_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 297, in _send_get_data_rpc_request
if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 906, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 123, in run
raise KeyboardInterrupt()
KeyboardInterrupt
^C

Environment information
docker image: intelanalytics/ipex-llm-serving-xpu:2.2.0-b15

Additional context

Hi, I will investigate this issue. Once we have found out the root-cause, we will get back to you.

Hi, this should have been fixed by #13007