Minimal vLLM demo using app.py.
The script loads a Hugging Face text generation model via vLLM and prints a short completion for a given prompt.
vLLM is an open-source, high-performance inference engine for large language models. It achieves high throughput and low latency with techniques like PagedAttention (efficient KV-cache management), continuous batching, tensor parallelism, and IO-aware scheduling. vLLM supports many Hugging Face models, offers a simple Python API and an OpenAI-compatible HTTP server, and is optimized for GPU execution.
vLLM does not natively support Windows. Running app.py on Windows typically fails with:
ModuleNotFoundError: No module named 'vllm._C'
That module is a compiled CUDA extension vLLM only ships for Linux-based platforms. To use this repo with vLLM, run it in a Linux environment (WSL2 or Docker) with an NVIDIA GPU/driver.
You have three practical paths:
- Run on Linux via WSL2 (recommended on Windows)
- In an elevated PowerShell:
wsl --install -d Ubuntu- Reboot if prompted, then open the Ubuntu app.
- On the Windows side, install the latest NVIDIA GPU driver that supports WSL.
- In Ubuntu (WSL2):
- Ensure the GPU is visible:
nvidia-smi(should show your GPU). If not, update drivers. - Install Python tools:
sudo apt update sudo apt install -y python3-venv python3-pip
- Clone or open this repo inside your Ubuntu home folder (recommended, not the Windows path), then:
python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install vllm transformers python app.py
- Ensure the GPU is visible:
- Docker with NVIDIA GPU
- Install Docker Desktop for Windows and enable the WSL 2 backend.
- Ensure the latest NVIDIA driver is installed on Windows. Enable GPU support in Docker Desktop.
- Verify GPU access:
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
- Run the vLLM OpenAI-compatible server (quick start):
.\scripts\run-server.ps1 -Model microsoft/Phi-3-mini-4k-instruct -Port 8000 -CacheVolume model-cache
- Or build and run this repo’s
app.pyinside a GPU-enabled container:.\scripts\run-app.ps1 -Image demo-vllm -CacheVolume model-cache
- Transformers CPU fallback (works on native Windows, slower)
- If you just want to test the flow without GPU, you can modify
app.pyto fall back to Hugging Face Transformers when vLLM isn’t available. Example pattern:try: from vllm import LLM, SamplingParams USE_VLLM = True except (ImportError, RuntimeError): USE_VLLM = False def demo(model_name: str, prompt: str = "Hello world", max_tokens: int = 50): if USE_VLLM: llm = LLM(model=model_name) outputs = llm.generate([prompt], SamplingParams(max_tokens=max_tokens)) print("Completion:", outputs[0].outputs[0].text) else: from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline tok = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) pipe = pipeline("text-generation", model=model, tokenizer=tok) out = pipe(prompt, max_new_tokens=max_tokens, do_sample=True)[0]["generated_text"] print("Completion:", out)
- For CPU, use a small model, e.g.,
TinyLlama/TinyLlama-1.1B-Chat-v1.0. Larger models will be very slow or may OOM on CPU.
Two helper scripts are provided in scripts/ to simplify GPU-enabled runs with Docker:
-
scripts/run-server.ps1- Starts the vLLM OpenAI-compatible HTTP server in a container.
- Parameters:
-Model(string, default"microsoft/Phi-3-mini-4k-instruct"): HF model ID.-Port(int, default8000): Host port to map to container port 8000.-CacheVolume(string, default"model-cache"): Named Docker volume for HF cache at/root/.cache/huggingfaceinside the container.
- Example:
.\scripts\run-server.ps1 -Model microsoft/Phi-3-mini-4k-instruct -Port 8000 -CacheVolume model-cache
- After it’s up, call
http://localhost:8000with an OpenAI-compatible client.
-
scripts/run-app.ps1- Builds the Docker image from the included
Dockerfileand runsapp.pyinside a GPU-enabled container. - Parameters:
-Image(string, default"demo-vllm"): Name to tag the built image.-CacheVolume(string, default"model-cache"): Named Docker volume for HF cache.
- Example:
.\scripts\run-app.ps1 -Image demo-vllm -CacheVolume model-cache
- Builds the Docker image from the included
Notes:
- If PowerShell blocks scripts, allow local scripts once:
Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
- The named volume persists downloaded model weights between runs.
- Ensure Docker Desktop is in Linux containers mode and GPU support is enabled.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install vllm transformers
python app.py- Example small-ish instruct models:
microsoft/Phi-3-mini-4k-instruct(GPU recommended)TinyLlama/TinyLlama-1.1B-Chat-v1.0(works on CPU for testing)
Update the call at the bottom of app.py to try different models.
ModuleNotFoundError: No module named 'vllm._C'on Windows: use WSL2/Docker or the Transformers CPU fallback as outlined above.- First run may download model weights from Hugging Face; ensure internet access and sufficient disk space.
- GPU out of memory: pick a smaller model or reduce
max_tokens. - Docker: ensure Linux containers and GPU support are enabled in Docker Desktop.
This repo is provided as a simple demo. Follow licenses of vLLM, Transformers, and the chosen model.