Auto-Converted Fast Tokenizer Producing Incorrect Results

Question

Auto-Converted Fast Tokenizer Producing Incorrect Results

young-geng opened this issue a year ago · comments

Xinyang (Young) Geng commented a year ago

System Info

transformers version: 4.30.1
Platform: Linux-5.15.107+-x86_64-with-glibc2.31
Python version: 3.10.12
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu118 (False)
Tensorflow version (GPU?): 2.12.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
Jax version: 0.4.10
JaxLib version: 0.4.10
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')

text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))

The code produces the following output:

[1, 14412]
[1, 31822, 496, 12719]

Expected behavior

The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.

Arthur · Answer 1 · Tue Jun 13 2023 22:13:53 GMT+0800 (China Standard Time)

Hey! Thanks for reporting. I am investigating this !

Stephan Tulkens · Answer 2 · Wed Jun 14 2023 12:43:52 GMT+0800 (China Standard Time)

Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR?

@young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸

Xinyang (Young) Geng · Answer 3 · Wed Jun 14 2023 13:06:24 GMT+0800 (China Standard Time)

@stephantul I can dig into it more to find some more examples. Could you tell me why this happens?

Stephan Tulkens · Answer 4 · Wed Jun 14 2023 13:15:06 GMT+0800 (China Standard Time)

I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above.

After that, I can compare and give you a proper analysis, should be done later today.

Stephan Tulkens · Answer 5 · Wed Jun 14 2023 14:26:50 GMT+0800 (China Standard Time)

The issue was that your tokenizer has a merge which has a score of 0, which is _t. This merge wasn't properly recorded, since the conversion code checked for Falsiness of the merge score, and not whether it existed.

i.e., it checked if vocab_score:, but it should have been checking if vocab_score is None:. Because of this, it removed the _t as a possible merge, which afflicts _thermal and other words starting with lowercase letter t.

Arthur · Answer 6 · Wed Jun 14 2023 17:54:58 GMT+0800 (China Standard Time)

Great work @stephantul ! Will review your PR to merge it asap!

MinWoo(Daniel) Park · Answer 7 · Mon Dec 04 2023 13:44:58 GMT+0800 (China Standard Time)

ArthurZucker

I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated?

Arthur · Answer 8 · Mon Dec 04 2023 14:40:02 GMT+0800 (China Standard Time)

Hey! This was available in the following releases: v4.35.2 v4.35.1 v4.35.0 v4.34.1 v4.34.0 v4.33.3 v4.33.2 v4.33.1 v4.33.0 v4.32.1 v4.32.0 v4.31.0

MinWoo(Daniel) Park · Answer 9 · Mon Dec 04 2023 18:44:55 GMT+0800 (China Standard Time)

ArthurZucker

Thank you for your response.

In the case of llama2 tokenizer, I have confirmed that all 8.56 billion tokens in datasets of famous LLMs are tokenized identically in both the fast tokenizer and slow tokenizer even with transformers version 4.31.0.

Arthur · Answer 10 · Mon Dec 04 2023 20:03:59 GMT+0800 (China Standard Time)

Awesome 🚀

Boris Yangel · Answer 11 · Thu Jun 13 2024 19:46:38 GMT+0800 (China Standard Time)

Hey, I think the bug might be back.

I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for dinhanhx/llama-tokenizer-hf and mistralai/Mistral-7B-v0.3

Arthur · Answer 12 · Wed Jun 19 2024 19:33:06 GMT+0800 (China Standard Time)

Hey! Can you either share a small reproducer or share the tests you are running?