vLLM

This repo is a fork of the vLLM repo.

Usage

Pull the latest image from ECR:

bash docker/pull.sh vllm:latest

Run the container (with Llama3 8B in this case):

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm \
    --model meta-llama/Meta-Llama-3-8B-Instruct

Development

Setup dev mode

Clone the repo and setup the base docker image:

docker run --gpus all -it --rm --ipc=host \
	-v $(pwd):/workspace/vllm \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	-p 8000:8000 \
	nvcr.io/nvidia/pytorch:23.10-py3

Once done, install vLLM in dev mode and the dev requirements in the container:

cd vllm
export VLLM_INSTALL_PUNICA_KERNELS=1
export MAX_JOBS=8
pip install -e .
pip install -r requirements-dev.txt
pip install boto3

It will take a while but once done, open another terminal on the host and run:

docker commit <container_id> vllm_dev

This will create a new image vllm_dev with the vLLM code installed. You won't need to install the dev dependencies again each time you start a new container.

From now on, you can exit the initial container and run this command to enter into the dev container:

docker run --gpus all -it --rm --ipc=host \
	-v $(pwd):/workspace/vllm \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	-p 8000:8000 \
	vllm_dev

Launch the server

Enter into the vllm_dev container and run:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct

Format the code

Enter into the vllm_dev container and run:

bash format.sh

Build the image

Once your changes are ready, you can build the prod image. Run these commands on the host:

bash docker/build.sh

And deploy it to ECR:

bash docker/deploy.sh <version>

Upgrade version

You can upgrade the version of vLLM by rebasing on the official repo:

git clone https://github.com/lightonai/vllm
git remote add official https://github.com/vllm-project/vllm
git fetch official
git rebase <commit_sha> # Rebase on a specific commit of the official repo (i.e. the commit sha of the last stable release)
git rebase --continue # After resolving conflicts (if any), continue the rebase
git push origin main --force

Deployment

To deploy a model on Sagemaker, follow this README.

About

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

Languages

Language:Python 82.7%Language:Cuda 12.5%Language:C++ 2.9%Language:Shell 1.2%Language:CMake 0.4%Language:Dockerfile 0.1%Language:C 0.1%