turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

command-r plus config

bdambrosio opened this issue · comments

I see you have max_position_embeddings as 8192, but model_max_length at 131072.
To fit in my current vRam I had to set model_max_length at 32768.
Should I leave max position embeddings at 8192?

thanks for all your work, AND for the command-r quant!

You'd typically specify the max length in whatever UI/API you're using anyway. But if you want to change the config.json file, ExLlama will ignore max_position_embeddings when model_max_length is present.

The value only specifies what the default context length is, and it can be used by backends to determine when they should use alpha scaling to accommodate more context than the model can natively handle. To account for the variety of models out there, each with their own unique take on what the standard is, ExLlama uses following config keys, in order of priority:

  • max_sequence_length
  • model_max_length
  • max_position_embeddings
  • max_seq_len

ExLlama uses following config keys, in order of priority:

* `max_sequence_length`

* `model_max_length`

* `max_position_embeddings`

* `max_seq_len`

If this isn't stated in the documentation, it should be.

The thing is that this changes with every new architecture. model_max_length was only just introduced by cmdr+, and the reason it takes priority over max_position_embeddings is that Cohere decided to first release the model with the latter key and a value of 8192 (for whatever reason; it's a 128k model), then adding model_max_length in a PR afterwards.

max_seq_len was added to the list when DBRX came out. DBRX also does a number of other things differently like moving the rope_theta key into the attn_config section. Basically every new architecture that comes out is implemented in custom code that's free to define the config.json format however it wants, and all a framework like ExLlama can do is try to copy whatever weird and wonderful decisions they make in that regard.

So there is no standard at all really, and there's no documentation I could write for how to hack the config.json that wouldn't have to be constantly revised. ExLlama instead parses the file into a standard class, ExLlamaV2Config, where the max sequence length is always defined by ExLlamaV2Config.max_seq_len.

I totally get the challenge in keeping what is what separated and in a semi-coherent order of priority.
Now imagine the confusion for us poor schmucks playing with LLMs on and off throughout the week.

Being explicit about these things either in Documentation or in console output may go a long way to make the fog of terminology overload slightly more penetrable :-)

If I haven't thanked you for Exllama and ExUI already: Thank You!