LiteObject / demo-vllm

Minimal text-generation demo using vLLM (GPU) with a Transformers (CPU) fallback. Includes Docker/WSL setup, PowerShell run scripts, and chat-template prompting.

Repository from Github https://github.comLiteObject/demo-vllmRepository from Github https://github.comLiteObject/demo-vllm

vLLM Demo

Minimal vLLM demo using app.py.

The script loads a Hugging Face text generation model via vLLM and prints a short completion for a given prompt.

vLLM is an open-source, high-performance inference engine for large language models. It achieves high throughput and low latency with techniques like PagedAttention (efficient KV-cache management), continuous batching, tensor parallelism, and IO-aware scheduling. vLLM supports many Hugging Face models, offers a simple Python API and an OpenAI-compatible HTTP server, and is optimized for GPU execution.

Windows limitation (important)

vLLM does not natively support Windows. Running app.py on Windows typically fails with:

ModuleNotFoundError: No module named 'vllm._C'

That module is a compiled CUDA extension vLLM only ships for Linux-based platforms. To use this repo with vLLM, run it in a Linux environment (WSL2 or Docker) with an NVIDIA GPU/driver.

Solutions

You have three practical paths:

  1. Run on Linux via WSL2 (recommended on Windows)
  • In an elevated PowerShell:
    • wsl --install -d Ubuntu
    • Reboot if prompted, then open the Ubuntu app.
  • On the Windows side, install the latest NVIDIA GPU driver that supports WSL.
  • In Ubuntu (WSL2):
    • Ensure the GPU is visible: nvidia-smi (should show your GPU). If not, update drivers.
    • Install Python tools:
      sudo apt update
      sudo apt install -y python3-venv python3-pip
    • Clone or open this repo inside your Ubuntu home folder (recommended, not the Windows path), then:
      python3 -m venv .venv
      source .venv/bin/activate
      python -m pip install --upgrade pip
      pip install vllm transformers
      python app.py
  1. Docker with NVIDIA GPU
  • Install Docker Desktop for Windows and enable the WSL 2 backend.
  • Ensure the latest NVIDIA driver is installed on Windows. Enable GPU support in Docker Desktop.
  • Verify GPU access:
    docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
  • Run the vLLM OpenAI-compatible server (quick start):
    .\scripts\run-server.ps1 -Model microsoft/Phi-3-mini-4k-instruct -Port 8000 -CacheVolume model-cache
  • Or build and run this repo’s app.py inside a GPU-enabled container:
    .\scripts\run-app.ps1 -Image demo-vllm -CacheVolume model-cache
  1. Transformers CPU fallback (works on native Windows, slower)
  • If you just want to test the flow without GPU, you can modify app.py to fall back to Hugging Face Transformers when vLLM isn’t available. Example pattern:
    try:
        from vllm import LLM, SamplingParams
        USE_VLLM = True
    except (ImportError, RuntimeError):
        USE_VLLM = False
    
    def demo(model_name: str, prompt: str = "Hello world", max_tokens: int = 50):
        if USE_VLLM:
            llm = LLM(model=model_name)
            outputs = llm.generate([prompt], SamplingParams(max_tokens=max_tokens))
            print("Completion:", outputs[0].outputs[0].text)
        else:
            from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
            tok = AutoTokenizer.from_pretrained(model_name)
            model = AutoModelForCausalLM.from_pretrained(model_name)
            pipe = pipeline("text-generation", model=model, tokenizer=tok)
            out = pipe(prompt, max_new_tokens=max_tokens, do_sample=True)[0]["generated_text"]
            print("Completion:", out)
  • For CPU, use a small model, e.g., TinyLlama/TinyLlama-1.1B-Chat-v1.0. Larger models will be very slow or may OOM on CPU.

Docker scripts (Windows PowerShell)

Two helper scripts are provided in scripts/ to simplify GPU-enabled runs with Docker:

  • scripts/run-server.ps1

    • Starts the vLLM OpenAI-compatible HTTP server in a container.
    • Parameters:
      • -Model (string, default "microsoft/Phi-3-mini-4k-instruct"): HF model ID.
      • -Port (int, default 8000): Host port to map to container port 8000.
      • -CacheVolume (string, default "model-cache"): Named Docker volume for HF cache at /root/.cache/huggingface inside the container.
    • Example:
      .\scripts\run-server.ps1 -Model microsoft/Phi-3-mini-4k-instruct -Port 8000 -CacheVolume model-cache
    • After it’s up, call http://localhost:8000 with an OpenAI-compatible client.
  • scripts/run-app.ps1

    • Builds the Docker image from the included Dockerfile and runs app.py inside a GPU-enabled container.
    • Parameters:
      • -Image (string, default "demo-vllm"): Name to tag the built image.
      • -CacheVolume (string, default "model-cache"): Named Docker volume for HF cache.
    • Example:
      .\scripts\run-app.ps1 -Image demo-vllm -CacheVolume model-cache

Notes:

  • If PowerShell blocks scripts, allow local scripts once:
    Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
  • The named volume persists downloaded model weights between runs.
  • Ensure Docker Desktop is in Linux containers mode and GPU support is enabled.

Quick start (Linux/WSL2)

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install vllm transformers
python app.py

Choosing a model

  • Example small-ish instruct models:
    • microsoft/Phi-3-mini-4k-instruct (GPU recommended)
    • TinyLlama/TinyLlama-1.1B-Chat-v1.0 (works on CPU for testing)

Update the call at the bottom of app.py to try different models.

Troubleshooting

  • ModuleNotFoundError: No module named 'vllm._C' on Windows: use WSL2/Docker or the Transformers CPU fallback as outlined above.
  • First run may download model weights from Hugging Face; ensure internet access and sufficient disk space.
  • GPU out of memory: pick a smaller model or reduce max_tokens.
  • Docker: ensure Linux containers and GPU support are enabled in Docker Desktop.

License

This repo is provided as a simple demo. Follow licenses of vLLM, Transformers, and the chosen model.

About

Minimal text-generation demo using vLLM (GPU) with a Transformers (CPU) fallback. Includes Docker/WSL setup, PowerShell run scripts, and chat-template prompting.


Languages

Language:Python 81.2%Language:PowerShell 12.5%Language:Dockerfile 6.2%