rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models

Home Page:https://docs.rs/llm/latest/llm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Supporting Llama-2 70B param

AmineDiro opened this issue Β· comments

Hello,
First of all, I want to thank you for this amazing crate! It is truly a joy to work on LLM using Rust πŸ˜„ .

I recently wrote an API that serves Llama-2 models using this crate.

I have an issue for serving Llama2-70B-GGML model. The 65B llama and 70B Llama-2 models use grouped query attention. This is done in llama.cpp by specifying the n_gqa params in model hyperparameters which feels a little bit hacky πŸ€”

I would love to work on adding support for the n_gqa on this crate, I think that it can be added to the Llama model:

/// LLaMA [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
#[derive(Debug, Default, PartialEq, Eq, Clone, Copy)]
pub struct Hyperparameters {
  ....
   pub n_head_kv: usize,
}

Where the n_head_kv is n_head_kv = n_head / n_gqa; and n_head % n_gqa == 0 and pass the n_gqa parameter in ModelParameters as Optional πŸ€”

Thank you for you help !

If you want you could create a PR containing these changes. The only thing i don't like about it is that the ModelParameters will then contain the n_gqa parameter which is only used by llama models and no other architectures but i guess until GGUF is finalized this is the best way to implement it πŸ€”

Yes, I've been considering implementing this but have held off because of the need to parameterise the model just for something that should hopefully be removed in a few weeks.

We actually had a mechanism for this before for this exact purpose, but I removed it after the model format was updated with a breaking change to remove the specific parameter.

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

Thanks for the reply! I 100% agree that adding it to ModelParameters feels hacky.

I think that setting it after the fact might be an issue because you need the n_head_kv at load time. Is there a heuristic we can use at load time for llama ggml models to determine that we need a the gqa (for example size of the file path). Then we could set it at a fixed value (n_gqa = 8 for Llama-2 models) and have it only in Hyperparameters struct of Llama πŸ€” ?

In my oppinion we should just hack it in for now, GGUF seams to be nearly ready meaning when we implement it we can cleanup the implementation. The most important thing is to get the grouped query attention working in the inference code.

Can this be closed since #409 was merged? Is full support for Llama 2 available now?

It should work, but it's not ideal, especially as GGUF support hasn't landed yet. It should be OK if you find a 70B model prior to GGUF and pass in the right number of GQA heads.

I'm leaving this open in the meantime as I think we should only really close it when it requires no additional work to use over the smaller models.