bm777 / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

candle-vllm

Efficient platform for inference and serving local LLMs including an OpenAI compatible API server.

Features

OpenAI compatible API server provided for serving LLMs.
Highly extensible trait-based system to allow rapid implementation of new module pipelines,
Streaming support in generation.

Overview

One of the goals of candle-vllm is to interface locally served LLMs using an OpenAI compatible API server.

During initial setup: the model, tokenizer and other parameters are loaded.
When a request is received:
Sampling parameters are extracted, including n - the number of choices to generate.
The request is converted to a prompt which is sent to the model pipeline.
- If a streaming request is received, token-by-token streaming using SSEs is established (n choices of 1 token).
- Otherwise, a n choices are generated and returned.

Contributing

The following features are planned to be implemented, but contributions are especially welcome:

Sampling methods:
- Beam search (huggingface/candle#1319)
- presence_penalty and frequency_penalty
Pipeline batching (#3)
KV cache (#3)
PagedAttention (#3)
More pipelines (from candle-transformers)

Resources

Python implementation: vllm-project
vllm paper

About

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

MIT License

Languages

Language:Rust 100.0%