These one click templates allow you to quickly boot up an API for a given language model.
- Make sure to read through the README file on the templates!
- Runpod is recommended (better user interface) if using larger GPUs like A6000, A100 or H100.
- Vast.AI is recommended for lowest cost per hour with smaller GPUs like A4000 and A2000. However, the user experience is significantly worse with Vast.AI than runpod.
Advanced inferencing scripts (incl. for function calling) are available for purchase here.
Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general.
To support the Trelis Research YouTube channel, you can sign up for an account with this link. Trelis is also supported by a 1-2% commission by your use of one-click templates.
- CUDA 12.1 one-click template here
- OpenChat 3.5 7B AWQ API - RECOMMENDED, OpenChat 3.5 7B bf16 - TGI API - lowest perplexity
- Mixtral Instruct API 4bit AWQ - RECOMMENDED, Mixtral Instruct API 8bit eetq, pod needs to be restarted multiple times to download all weights. Requires an A6000 or A100 or H100.
- Smaug 34B Chat (a Yi fine-tune) - fits in bf16 on an A100. BEWARE that guardrails are weaker on this model than Yi. As such, it may be best suited for structured generation
- Yi 34B Chat - fits in 16-bit on an A100
- Gemma Chat 9B.
- Notux 8x7B AWQ. Requires an A6000 or A100 or H100.
- CodeLlama 70B Instruct - 4bit AWQ, CodeLlama 70B Instruct - 4bit bitsandbytes. Requires an A6000 or A100 or H100.
- Mamba Instruct OpenHermes
- Llama 70B API by TrelisResearch.
- Deepseek Coder 33B Template.
- Medusa Vicuna (high speed speculative decoding - mostly a glamour template because OpenChat with AWQ is better quality and faster)
Note: The vLLM image has compatibility issues with certain CUDA drivers, leading to issues on certain pods. A6000 Ada is typically an option that works.
- Mistral Instruct 7B AWQ
- Mixtral Instruct 8x7B AWQ
- Qwen1.5 Chat 72B AWQ. Needs to be run on an A100 or H100. The 48 GB of VRAM on an A6000 is insufficient.
- CodeLlama 70B Instruct - 4bit AWQ. Requires an A6000 or A100 or H100.
Post a new issue if you would like other templates
To support the Trelis Research YouTube channel, you can sign up for an account with this affiliate link. Trelis is also supported by a 1-2% commission by your use of one-click templates.
- CUDA 12.1 one-click template here.
- Mistral 7B v0.2 AWQ
- Post a new issue if you would like other templates
One-click templates for function-calling are located on the HuggingFace model cards. Check out the collection here.
## Changelog
Feb 16 2023:
- Added a Mamba one click template.
Jan 21 2023:
- Swapped Runpod to before Vast.AI as user experience is much better with Runpod.
Jan 9 2023:
- Added Mixtral Instruct AWQ TGI
Dec 30 2023:
- Support gated models by adding HUGGING_FACE_HUB_TOKEN env variable.
- Speed up downloading using HuggingFace API.
Dec 29 2023:
- Add in one-click llama.cpp server template.