marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Something wrong with a generator

yukiarimo opened this issue · comments

My first approach

model = AutoModelForCausalLM.from_pretrained(
config["server"]["models_dir"] + config["server"]["default_model_file"],
model_type='llama2',
max_new_tokens=config["ai"]["max_new_tokens"],
context_length=config["ai"]["context_length"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
seed=config["ai"]["seed"],
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
stop=config["ai"]["stop"],
batch_size=config["ai"]["batch_size"],
gpu_layers=config["ai"]["gpu_layers"]
)


print("TOKENS: ", len(model.tokenize(new_history)))
# A lot

new_history_crop = model.tokenize(new_history)
# Take only allowed length - 3 elements from the end
new_history_crop = new_history_crop[-(config["ai"]["context_length"] - 3):]
print("CONTEXT LENGTH: ", -(config["ai"]["context_length"] - 3))

# This will be 509 (allowed 512)

print(len(new_history_crop))
response = model(model.detokenize(new_history_crop), stream=False)

But generator results error:

Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).
Number of tokens (520) exceeded maximum context length (512).

...and so on.

Question: Why?

My second approach

# new_history_crop is a list of 509 tokens

response = model.generate(
tokens=new_history_crop,
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
batch_size=config["ai"]["batch_size"],
threads=config["ai"]["threads"],
)

response = model.detokenize(list(response))

And this works! But here's 2 problems:

1. It's slower
2. It doesn't support all parameters like from the first approach

Please help me fix this and/or explain why it is so.

I found a solution myself:

  1. Second approach is a shit
  2. Context length is used for both model in and out