Use llama-swap inside a container with vllm, llama.cpp, and exllamav2+tabbyAPI.
This repo is my working config, it's mainly used on a 16 core Ryzen machine with 64 GiB RAM and a single RTX 3090. For customization, you might want to grep for a few keywords:
$ git grep -E '\b868[6-8]\b' # port numbers
$ git grep sk-empty # API token/key
$ git grep -iE '(logging|loglevel|verbos)'
$ git grep -E "\b(8\.6|86)\b" # CUDA compute arch, 8.6 == ampere (RTX 3090)
$ head ./bin/host-llm-multi-backend-container.sh
$ ./bin/host-llm-multi-backend-container.sh --build --force-recreate
See what model/backend combinations are available:
$ curl -s -X GET -H "Authorization: Bearer sk-empty" http://localhost:8686/v1/models | jq -r '.data[].id' | grep -i 'qwen2.5-coder-7b'
vllm-Qwen2.5-Coder-7B
llamacpp-Qwen2.5-Coder-7B
exllamav2-Qwen2.5-Coder-7B
$ bash -x scripts/test-chat-completions.sh
+ curl -s -X POST http://localhost:8688/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: Bearer sk-empty' -d '{"model": "llamacpp-glm-4.5-air", "messages": [{"role": "user", "content": "Answer only with the missing word: The capital of Sweden is"}]}'
+ jq '.choices[0].message.content'
"\n<think>We are to answer with only the missing word. The question is: \"The capital of Sweden is\"\n The capital of Sweden is Stockholm. Therefore, the missing word is \"Stockholm\".</think>Stockholm"
+ retcode=0
+ '[' 0 -ne 0 ']'
+ return 0
+ exit
$ ./scripts/enter-container-llama-swap.sh watch "ps aux | grep -E '(vllm|llama-|tabbyAPI)' | grep -v emacs | grep -v 'grep -E'"
$ while true; do clear; date; echo -n "currently loaded model: "; curl -s localhost:8686/running | jq -r '.running[0].model'; echo '...sleeping for 60 seconds'; sleep 60; done
$ curl -s localhost:8686/logs/stream/upstream
$ curl -s localhost:8686/logs/stream/proxy
$ ./scripts/enter-container-llama-swap.sh tail -F /tmp/llama-server-stdout-stderr.log
$ curl -H "Authorization: Bearer sk-empty" http://localhost:8686/upstream/llamacpp-Qwen3-30B-A3B/health
$ curl -H "Authorization: Bearer sk-empty" http://localhost:8686/upstream/llamacpp-Qwen3-30B-A3B/slots | jq
Downloading Unsloth's Maverick:
$ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF --exclude "*.gguf"
$ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF --include "UD-Q2_K_XL/*.gguf"
Downloading Unsloth's Q2_K_XL quants (248 GB) of DeepSeek V3 0324:
$ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/DeepSeek-V3-0324-GGUF --exclude "*.gguf" \
&& HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/DeepSeek-V3-0324-GGUF --include "UD-Q2_K_XL/*.gguf"
deepseek-v3 (I only have 64GB of RAM, which is not enough)
# notes:
# 1. maybe use:
# - https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF
# - https://github.com/ikawrakow/ik_llama.cpp/discussions/258
llamacpp-deepseek-v3-0324:
cmd: |
/opt/llama.cpp/build/bin/llama-server
--port ${PORT}
--ctx-size 16384
--seed "-1"
--prio 2
--temp 0.3
--min-p 0.01
--model /root/.cache/huggingface/hub/models--unsloth--DeepSeek-V3-0324-GGUF/snapshots/b3e19c41e42074be413d73f1d0e1b7f2be9e60c3/UD-IQ2_XXS/DeepSeek-V3-0324-UD-IQ2_XXS-00001-of-00005.gguf # ~219GB for 1..5
--n-gpu-layers 1
--ubatch-size 1
--jinja
#--model /root/.cache/huggingface/hub/models--unsloth--DeepSeek-V3-0324-GGUF/snapshots/b3e19c41e42074be413d73f1d0e1b7f2be9e60c3/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf # zombie process after reading 231G (of 248G)
proxy: http://127.0.0.1:${PORT}
ttl: 3600
24GB of VRAM is not enough for Qwen2.5-VL-32B it seems
llamacpp-Qwen2.5-VL-32B:
cmd: |
/opt/llama.cpp/build/bin/llama-server
--port ${PORT}
--ctx-size 4096
--cache-type-k q8_0
--cache-type-v q4_0
--flash-attn
--n-gpu-layers 64
--hf-repo mradermacher/Qwen2.5-VL-32B-Instruct-i1-GGUF:i1-IQ3_S
--temp 0.15
proxy: http://127.0.0.1:${PORT}
ttl: 3600
Instead of using vLLM, we could probably use phildougherty python app (see submodule), currently not yet working though...
phildougherty-Qwen2.5-VL-7B:
cmd: |
python3 /phildougherty-qwen-vl-api/app.py
--model Qwen2.5-VL-7B-Instruct
--port ${PORT}
--quant int8
# --quant int4
proxy: http://127.0.0.1:${PORT}
ttl: 3600
draft model for QwQ-32B (I need an additional GPU for it to make sense)
``` #--hf-repo-draft mradermacher/Qwen2.5-Coder-0.5B-QwQ-draft-i1-GGUF:Q4_K_M # <-- token 151665 content differs - target '', draft '' --hf-repo-draft bartowski/InfiniAILab_QwQ-0.5B-GGUF:Q8_0 --n-gpu-layers-draft 99 --override-kv tokenizer.ggml.bos_token_id=int:151643 # --draft-max 16 # --draft-min 5 # --draft-p-min 0.5 ```testing qwen2.5-coder-7b on port 11902
$ ./scripts/host-qwen2.5-coder-7b_localhost_port11902.sh
$ env OPENAI_API_BASE=localhost:11902/v1 OPENAI_API_KEY=sk-empty \
./scripts/test-chat-completions.sh modelnameplaceholder "In python, how do I defer deletion of a specific path to end of program?" \
| jq -r | batcat -pp -l md