intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

vllm on tensor parallel - RuntimeError: oneCCL: ze_fd_manager.cpp:144 init_device_fds: EXCEPTION: opendir failed: could not open device directory

flekol opened this issue · comments

Describe the bug
while using vllm in docker and using tensor parallel i get:
RuntimeError: oneCCL: ze_fd_manager.cpp:144 init_device_fds: EXCEPTION: opendir failed: could not open device directory

single card serving works

Llama.cpp works also for multi gpus

i remember some months ago i tried it on intel system with 2 GPUs and it worked.

How to reproduce
Steps to reproduce the error:

  1. multi gpus
  2. try to run

Screenshots
If applicable, add screenshots to help explain the problem

Environment information
I'm building myown docker image based on this:

FROM intelanalytics/ipex-llm-serving-xpu:latest
WORKDIR /temp

SHELL ["/bin/bash", "-c"] 
RUN apt update && apt install -y libpng16-16
RUN wget http://mirrors.kernel.org/ubuntu/pool/main/libj/libjpeg-turbo/libjpeg-turbo8_2.1.2-0ubuntu1_amd64.deb  
RUN apt install ./libjpeg-turbo8_2.1.2-0ubuntu1_amd64.deb

WORKDIR /llm
RUN . /opt/intel/1ccl-wks/setvars.sh
ENTRYPOINT python -m  ipex_llm.vllm.xpu.entrypoints.openai.api_server \ 
  --served-model-name ${served_model_name} \
  --quantization $quantization \
  --model $model \
  --port $port \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization ${gpu_memory_utilization} \
  --device xpu \
  --dtype $dtype \
  --enforce-eager \
  --load-in-low-bit ${load_in_low_bit} \
  --max-model-len ${max_model_len} \
  --max-num-batched-tokens ${max_num_batched_tokens} \
  --max-num-seqs ${max_num_seqs} \
  --tensor-parallel-size ${tensor_parallel_size} \ 
  --disable-async-output-proc \
  --distributed-executor-backend ray

The docker-compose file looks like below:

services:
  vllm-ipex:
    image: intelanalytics/ipex-llm-serving-xpu:latest
    container_name: vllm-ipex
    build:
      dockerfile: ./dockerfile/dockerfile
    volumes:
      - "/models/huggingface:/root/.cache/huggingface"
      - "/models/vllm:/llm/models"
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    # restart: unless-stopped
    devices:
      - /dev/dri:/dev/dri
    tty: true
    ports:
    - 8000:8000
    shm_size: "64g"
    environment:
      - model=Qwen/Qwen2.5-32B-Instruct-AWQ
      - served_model_name=Qwen2.5-32B-Instruct-AWQ
      - quantization=awq
      - TZ=Europe/Berlin
      - SYCL_CACHE_PERSISTENT=1
      - CCL_WORKER_COUNT=2
      - FI_PROVIDER=shm
      - CCL_ATL_TRANSPORT=ofi
      - CCL_ZE_IPC_EXCHANGE=sockets
      - CCL_ATL_SHM=1
      - USE_XETLA=OFF
      - SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
      - TORCH_LLM_ALLREDUCE=0
      - CCL_SAME_STREAM=1
      - CCL_BLOCKING_WAIT=0
      - port=8000
      - gpu_memory_utilization=0.95
      - dtype=float16  
      - load_in_low_bit=asym_int4
      - max_model_len=2048
      - max_num_batched_tokens=4000
      - max_num_seqs=256
      - tensor_parallel_size=2
      - pipeline_parallel_size=1
      - VLLM_LOGGING_LEVEL=DEBUG
      - VLLM_TRACE_FUNCTION=1

System: Ubuntu 24.04
CPU: EPYC 7282
MB: Supermicro h12ssl
RAM: 256GB
GPUs: 4xARC 770 LE

logs

dmesg.txt
log.txt

Hi, can you try to add privileged: true into the docker compose file and see if this error persists?

Thanks a lot.

It works, now i feel stupid that i did not come up with this on my own :)