jllllll / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Llama2 70B: can't use more than 2048 tokens context

langchain4j opened this issue · comments

Hi @jllllll thanks a lot for your work!

I am using exllama (0.12.0) via oobabooga's openai api (main branch) and it seems that max_seq_len is ignored.
If I set it to 4096 and send a request exceeding 2048 tokens, ooba will fail:

 - - [23/Aug/2023 12:46:53] "POST /v1/v1/chat/completions HTTP/1.1" 200 -
Exception: When loading instruction-templates/None.yaml: FileNotFoundError(2, 'No such file or directory')
Warning: Loaded default instruction-following template for model.
 - - [23/Aug/2023 12:46:53] "POST /v1/v1/chat/completions HTTP/1.1" 400 -

I've debugged the code in https://github.com/oobabooga/text-generation-webui/blob/main/modules/exllama.py and see that config.max_seq_len is set correctly.

Could you please help me with this?
Thanks a lot in advance!

The code for using max_seq_len properly was added to ooba in this commit: oobabooga/text-generation-webui@ef17da7
Is your installation up to date with that commit?

The error that you showed is complaining about not finding an instruction template. I'll check the openai extension code to see if that is being set properly.

At the bottom of the Models section of the openai extension's README, it mentions what to do if you see that error:
https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md#models

For the proper instruction format to be detected you need to have a matching model entry in your models/config.yaml file. Be sure to keep this file up to date. A matching instruction template file in the characters/instruction-following/ folder will loaded and applied to format messages correctly for the model - this is critical for good results.

For example, the Wizard-Vicuna family of models are trained with the Vicuna 1.1 format. In the models/config.yaml file there is this matching entry:

.*wizard.*vicuna:
  mode: 'instruct'
  instruction_template: 'Vicuna-v1.1'

This refers to characters/instruction-following/Vicuna-v1.1.yaml, which looks like this:

user: "USER:"
bot: "ASSISTANT:"
turn_template: "<|user|> <|user-message|>\n<|bot|> <|bot-message|></s>\n"
context: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n"

For most common models this is already setup, but if you are using a new or uncommon model you may need add a matching entry to the models/config.yaml and possibly create your own instruction-following template and for best results.

If you see this in your logs, it probably means that the correct format could not be loaded:

Warning: Loaded default instruction-following template for model.

Yes, I am running the latest version. I can see max_seq_len=self.model.config.max_seq_len in exllama.py

The error that you showed is complaining about not finding an instruction template. I'll check the openai extension code to see if that is being set properly.

Please ignore exception regarding templates, it is irrelevant. I was referring to HTTP status 400

This is how I run ooba:
python server.py --listen --listen-port=8149 --chat --multi-user --extensions openai --model-dir /local0/xxx/ooba_models --model TheBloke_Llama-2-70B-chat-GPTQ --loader exllama --max_seq_len 4096

This is what I get in logs if I send request with 1886 input tokens and 500 max_tokens:
Warning: $This model maximum context length is 2048 tokens. However, your messages resulted in over 1886 tokens and max_tokens is 500.

On the line 121 in exllama.py the value of max_new_tokens is 162.

I tried to manually set it to max_new_tokens=3000-ids.shape[-1], but then I get HTTP 400 if I send input prompt longer than 2048...

BTW I am getting 400 even if I don't override max_new_tokens and send input prompt longer than 2048.

Ok, it seems like an issue on the side of ooba. Sorry for disturbing.

I think I know what is wrong

POST /v1/v1/chat/completions shows that there are 2 instances of /v1 in the endpoint. There should be only one.

The exact endpoint that you should be using is /v1/chat/completions

@jllllll oops. But it works the same. I mean, even with two "v1" it is working, but only with the input prompt less than 2k tokens. the problem is that when I send more than 2k tokens, it fails.

I'm honestly not sure where the issue is.
If it is a problem with the openai extension, then you should open an issue in the text-gen repo.
If it is with exllama, then you should open an issue in the main exllama repo.

This repo is purely for maintaining an installable package for exllama.
Very little of the exllama code is different from what is in the main repo, so any problems with that code should be handled by them.

It is indeed an issue in ooba. Had to do these changes to do a quick fix:
In https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/completions.py comment out lines 239-241:

if token_count >= req_params['truncation_length']:
        err_msg = f"This model maximum context length is {req_params['truncation_length']} tokens. However, your messages resulted in over {token_count} tokens."
        raise InvalidRequestError(message=err_msg, param='messages')

In https://github.com/oobabooga/text-generation-webui/blob/main/modules/exllama.py put on line 120:
max_new_tokens=4096-ids.shape[-1]