EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support using arbitrary derivative models

ivanbaldo opened this issue · comments

Currently the models need to be specified as llama7b for example, but what if one wants to use codellama/CodeLlama-7b-hf or meta-llama/Llama-2-7b-hf (non chat version), etc.?
A more flexible method should be implemented in the future.

@ivanbaldo, thank you for this idea. Perhaps specifying models via a model ID could be implemented.

This might be easier than the idea I had
I was trying to port support for quantized gguf models from this candle example, but am a bit lost bringing it in:
https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized/main.rs

might be also an issue to know the base llama model there to set parameters correctly - I don't know if gguf has all the infos you need in model metadata

GGUF would be a great addition! However, I am now working on mistral.rs, the successor to this project: https://github.com/EricLBuehler/mistral.rs

Mistral.rs currently has quantized and normal Mistral models, and may be used with arbitrary derivative models. It provides an openai-compatible server and there is a simple chat example.

Currently the models need to be specified as llama7b for example, but what if one wants to use codellama/CodeLlama-7b-hf or meta-llama/Llama-2-7b-hf (non chat version), etc.? A more flexible method should be implemented in the future.

Please also refer to this PR #46 , it can load arbitrary models under the given model architecture.

@ivanbaldo closing this as we can support loading weights of arbitrary derivative models. Please feel free to reopen!