Something wrong with a generator

Question

Something wrong with a generator

yukiarimo opened this issue 6 months ago · comments

My first approach

model = AutoModelForCausalLM.from_pretrained(
config["server"]["models_dir"] + config["server"]["default_model_file"],
model_type='llama2',
max_new_tokens=config["ai"]["max_new_tokens"],
context_length=config["ai"]["context_length"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
seed=config["ai"]["seed"],
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
stop=config["ai"]["stop"],
batch_size=config["ai"]["batch_size"],
gpu_layers=config["ai"]["gpu_layers"]
)


print("TOKENS: ", len(model.tokenize(new_history)))
# A lot

new_history_crop = model.tokenize(new_history)
# Take only allowed length - 3 elements from the end
new_history_crop = new_history_crop[-(config["ai"]["context_length"] - 3):]
print("CONTEXT LENGTH: ", -(config["ai"]["context_length"] - 3))

# This will be 509 (allowed 512)

print(len(new_history_crop))
response = model(model.detokenize(new_history_crop), stream=False)

But generator results error:

Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).
Number of tokens (520) exceeded maximum context length (512).

...and so on.

Question: Why?

My second approach

# new_history_crop is a list of 509 tokens

response = model.generate(
tokens=new_history_crop,
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
batch_size=config["ai"]["batch_size"],
threads=config["ai"]["threads"],
)

response = model.detokenize(list(response))

And this works! But here's 2 problems:

1. It's slower
2. It doesn't support all parameters like from the first approach

Please help me fix this and/or explain why it is so.

Yuki Arimo · Answer 1 · Sun Nov 26 2023 03:44:16 GMT+0800 (China Standard Time)

I found a solution myself:

Second approach is a shit
Context length is used for both model in and out