Supporting Llama-2 70B param

Question

Supporting Llama-2 70B param

AmineDiro opened this issue a year ago · comments

Hello,
First of all, I want to thank you for this amazing crate! It is truly a joy to work on LLM using Rust 😄 .

I recently wrote an API that serves Llama-2 models using this crate.

I have an issue for serving Llama2-70B-GGML model. The 65B llama and 70B Llama-2 models use grouped query attention. This is done in llama.cpp by specifying the n_gqa params in model hyperparameters which feels a little bit hacky 🤔

I would love to work on adding support for the n_gqa on this crate, I think that it can be added to the Llama model:

/// LLaMA [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
#[derive(Debug, Default, PartialEq, Eq, Clone, Copy)]
pub struct Hyperparameters {
  ....
   pub n_head_kv: usize,
}

Where the n_head_kv is n_head_kv = n_head / n_gqa; and n_head % n_gqa == 0 and pass the n_gqa parameter in ModelParameters as Optional 🤔

Thank you for you help !

Lukas Kreussel · Answer 1 · Sat Aug 12 2023 20:00:10 GMT+0800 (China Standard Time)

If you want you could create a PR containing these changes. The only thing i don't like about it is that the ModelParameters will then contain the n_gqa parameter which is only used by llama models and no other architectures but i guess until GGUF is finalized this is the best way to implement it 🤔

Philpax · Answer 2 · Sun Aug 13 2023 07:49:23 GMT+0800 (China Standard Time)

Yes, I've been considering implementing this but have held off because of the need to parameterise the model just for something that should hopefully be removed in a few weeks.

We actually had a mechanism for this before for this exact purpose, but I removed it after the model format was updated with a breaking change to remove the specific parameter.

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

AmineDiro · Answer 3 · Mon Aug 14 2023 23:18:42 GMT+0800 (China Standard Time)

I don't think I want to reintroduce it again - especially as we should never need to do this again - but I would like to support 70B at some point. I'm not sure. Maybe a bodge function on the Llama model that lets you set n_head_kv after the fact?

Thanks for the reply! I 100% agree that adding it to ModelParameters feels hacky.

I think that setting it after the fact might be an issue because you need the n_head_kv at load time. Is there a heuristic we can use at load time for llama ggml models to determine that we need a the gqa (for example size of the file path). Then we could set it at a fixed value (n_gqa = 8 for Llama-2 models) and have it only in Hyperparameters struct of Llama 🤔 ?

Lukas Kreussel · Answer 4 · Tue Aug 15 2023 01:27:13 GMT+0800 (China Standard Time)

In my oppinion we should just hack it in for now, GGUF seams to be nearly ready meaning when we implement it we can cleanup the implementation. The most important thing is to get the grouped query attention working in the inference code.

Liam Woodleigh-Hardinge · Answer 5 · Sat Oct 28 2023 22:43:30 GMT+0800 (China Standard Time)

Can this be closed since #409 was merged? Is full support for Llama 2 available now?

Philpax · Answer 6 · Sun Oct 29 2023 01:12:08 GMT+0800 (China Standard Time)

It should work, but it's not ideal, especially as GGUF support hasn't landed yet. It should be OK if you find a 70B model prior to GGUF and pass in the right number of GQA heads.

I'm leaving this open in the meantime as I think we should only really close it when it requires no additional work to use over the smaller models.