Something wrong with a generator
yukiarimo opened this issue · comments
Yuki Arimo commented
My first approach
model = AutoModelForCausalLM.from_pretrained(
config["server"]["models_dir"] + config["server"]["default_model_file"],
model_type='llama2',
max_new_tokens=config["ai"]["max_new_tokens"],
context_length=config["ai"]["context_length"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
seed=config["ai"]["seed"],
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
stop=config["ai"]["stop"],
batch_size=config["ai"]["batch_size"],
gpu_layers=config["ai"]["gpu_layers"]
)
print("TOKENS: ", len(model.tokenize(new_history)))
# A lot
new_history_crop = model.tokenize(new_history)
# Take only allowed length - 3 elements from the end
new_history_crop = new_history_crop[-(config["ai"]["context_length"] - 3):]
print("CONTEXT LENGTH: ", -(config["ai"]["context_length"] - 3))
# This will be 509 (allowed 512)
print(len(new_history_crop))
response = model(model.detokenize(new_history_crop), stream=False)
But generator results error:
Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).
Number of tokens (520) exceeded maximum context length (512).
...and so on.
Question: Why?
My second approach
# new_history_crop is a list of 509 tokens
response = model.generate(
tokens=new_history_crop,
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
batch_size=config["ai"]["batch_size"],
threads=config["ai"]["threads"],
)
response = model.detokenize(list(response))
And this works! But here's 2 problems:
1. It's slower
2. It doesn't support all parameters like from the first approach
Please help me fix this and/or explain why it is so.
Yuki Arimo commented
I found a solution myself:
- Second approach is a shit
- Context length is used for both model in and out