Support Llama / Hugging Face's Universal Format (GGUF)

Question

Support Llama / Hugging Face's Universal Format (GGUF)

BrainSlugs83 opened this issue 2 months ago · comments

Are there any plans to support the Hugging Face / Llama.cpp universal format (GGUF)?

This format is very popular and uses just a single file to describe a whole model (even for mixture of experts models), it's optimized for fast loading for inference (whether on the CPU or GPU, or elsewhere), and supports quantization. There's also built-in tooling on hugging face to automatically convert other repositories to this format.

The format is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new information can be added to models without breaking compatibility.

And there is a huge repo of models based on Llama, Mistral, etc. that are already in this format; including fine tunes of Microsoft Phi.

It would be hugely convenient for developers if the DirectML model loader could just load those these...

[Side Question: what is the current supported format? -- I can't really find any repos on hugging face that seem similar enough to the supported Phi-3 repo that "just work" they always complain about missing JSON files, etc. -- and I'm not even fully sure what the current format is, let alone how to convert an existing model to it.]

Mikey commented a month ago

Bump.

Jarrod Davis · Answer 1 · Mon Jul 01 2024 15:13:51 GMT+0800 (China Standard Time)

@BrainSlugs83 I saw this post and was inspired to come up with a solution, at least in the interim, because I was hoping for a single file format as well. I'm working on this project you may find useful. I will just let the video role. If it is something you may be interested in and wish to test it, let me know.

vfolder_phi3-onnx.mp4

Nat Kershaw (MSFT) · Answer 2 · Tue Jul 02 2024 01:49:45 GMT+0800 (China Standard Time)

Hi @BrainSlugs83, this API uses the ONNX (Open Neural Network Exchange) model format. Moving this issue into a Discussion as a feature request.