You can click on Serverless vLLM
at the top of the
Explore Templates
page on RunPod.
Be sure to specify your model name and maximum model length. (Refer to the docs).
git clone https://github.com/ashleykleynhans/runpod-vllm-scripts.git
cd runpod-vllm-scripts
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
Note
The MODEL
is the path to the model on Hugging Face Hub
and not the full URL. For example,
https://huggingface.co/cognitivecomputations/dolphin-2.9.1-llama-3-8b
would be cognitivecomputations/dolphin-2.9.1-llama-3-8b
.
- Copy the file
.env.example
to a new file called.env
. - Edit the
.env
file and set your RunPod API key, RunPod Endpoint ID, and model path to the model on Hugging Face Hub. - Save the file.
Script Name | Description |
---|---|
chat_completions.py | Chat completions WITHOUT streaming |
chat_completions_streaming.py | Chat completions WITH streaming |
completions.py | Completions WITHOUT streaming |
completions_streaming.py | Completions WITH streaming |
list_models.py | List Available models |
Once you have installed the scripts and activated the env, you can run any of the above scripts, for example:
python3 chat_completions_streaming.py
You obviously need to edit the scripts to set your prompt, message, model, etc.