rustformers / llm

I have discovered that running the same model with the same parameters from llm (gguf branch) and llama.cpp results in a different behavior. llm seems to have not been reading EOS token and thus the model creates output until max tokens is reached.
Here is llama.cpp:

And the same model from llm:

According to discord "discussion" it might be indeed a bug.

Thanks for reporting this! For my own reference, the issue is that this doesn't get the EOT from the tokenizer - instead, it assumes that it's the hardcoded token </s>. This made sense in the early days of LLaMA, but is no longer true:

llm/crates/models/llama/src/lib.rs

Line 373 in e61e5f9

self.tokenizer().id("</s>".as_bytes()).unwrap_or(2)

EOS is not read from gguf format