Auto-Converted Fast Tokenizer Producing Incorrect Results
young-geng opened this issue · comments
System Info
transformers
version: 4.30.1- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): 2.12.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')
text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))
The code produces the following output:
[1, 14412]
[1, 31822, 496, 12719]
Expected behavior
The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.
Hey! Thanks for reporting. I am investigating this !
Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR?
@young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸
@stephantul I can dig into it more to find some more examples. Could you tell me why this happens?
I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above.
After that, I can compare and give you a proper analysis, should be done later today.
The issue was that your tokenizer has a merge which has a score of 0, which is _t
. This merge wasn't properly recorded, since the conversion code checked for Falsiness of the merge score, and not whether it existed.
i.e., it checked if vocab_score:
, but it should have been checking if vocab_score is None:
. Because of this, it removed the _t
as a possible merge, which afflicts _thermal
and other words starting with lowercase letter t
.
Great work @stephantul ! Will review your PR to merge it asap!
I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated?
Awesome 🚀
Hey, I think the bug might be back.
I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for dinhanhx/llama-tokenizer-hf
and mistralai/Mistral-7B-v0.3
Hey! Can you either share a small reproducer or share the tests you are running?