vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page:https://docs.vllm.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Usage]: distributed inference with kuberay

hetian127 opened this issue · comments

Your current environment

kuberay,vllm 0.4.0
L40 GPU server 2, each one with L408, CX6 IB card 200G*2

How would you like to use vllm

I plan to use KubeRay to implement multi-node distributed inference based on the vLLM framework. In the current environment, each GPU server node is interconnected with an IB network. How can I achieve RDMA between multiple nodes?

What type of distributed inferencing do you plan to do? Is it model parallel or data parallel?

I just want to use online api serving based on LLM like Qwen1.5-110B-Chat.
My main steps are as follows:
1, I made docker image include ofed driver, "ibstat" can show my 200G infiniband card.

2, create yaml file, like:

rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: repo:5000/harbor/rayvllm:v3
resources:
limits:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
requests:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: repo:5000/harbor/rayvllm:v3
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
requests:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"

3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node:

python -m vllm.entrypoints.openai.api_server
--model /path/Qwen1.5-110B-Chat
--tensor-parallel-size 16
--host 0.0.0.0
--trust-remote-code
--port 8000
--worker-use-ray

4, I run a benckmark scripts like:

python benchmarks/benchmark_serving.py
--backend vllm
--model /path/Qwen1.5-110B-Chat
--dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 5
--num-prompts 100
--host xxxx
--port 8000
--trust-remote-code \

I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth.

So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node

commented

i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s).
i didnt set the USE_RDMA though