Efficient platform for inference and serving local LLMs including an OpenAI compatible API server.
- OpenAI compatible API server provided for serving LLMs.
- Highly extensible trait-based system to allow rapid implementation of new module pipelines,
- Streaming support in generation.
One of the goals of candle-vllm
is to interface locally served LLMs using an OpenAI compatible API server.
-
During initial setup: the model, tokenizer and other parameters are loaded.
-
When a request is received:
-
Sampling parameters are extracted, including
n
- the number of choices to generate. -
The request is converted to a prompt which is sent to the model pipeline.
- If a streaming request is received, token-by-token streaming using SSEs is established (
n
choices of 1 token). - Otherwise, a
n
choices are generated and returned.
- If a streaming request is received, token-by-token streaming using SSEs is established (
The following features are planned to be implemented, but contributions are especially welcome:
- Sampling methods:
- Beam search (huggingface/candle#1319)
presence_penalty
andfrequency_penalty
- Pipeline batching (#3)
- KV cache (#3)
- PagedAttention (#3)
- More pipelines (from
candle-transformers
)
- Python implementation:
vllm-project
vllm
paper