huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Home Page:https://huggingface.co/transformers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Auto-Converted Fast Tokenizer Producing Incorrect Results

young-geng opened this issue · comments

System Info

  • transformers version: 4.30.1
  • Platform: Linux-5.15.107+-x86_64-with-glibc2.31
  • Python version: 3.10.12
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu118 (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
  • Jax version: 0.4.10
  • JaxLib version: 0.4.10
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')

text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))

The code produces the following output:

[1, 14412]
[1, 31822, 496, 12719]

Expected behavior

The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.

commented

Hey! Thanks for reporting. I am investigating this !

Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR?

@young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸

@stephantul I can dig into it more to find some more examples. Could you tell me why this happens?

I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above.

After that, I can compare and give you a proper analysis, should be done later today.

The issue was that your tokenizer has a merge which has a score of 0, which is _t. This merge wasn't properly recorded, since the conversion code checked for Falsiness of the merge score, and not whether it existed.

i.e., it checked if vocab_score:, but it should have been checking if vocab_score is None:. Because of this, it removed the _t as a possible merge, which afflicts _thermal and other words starting with lowercase letter t.

commented

Great work @stephantul ! Will review your PR to merge it asap!

ArthurZucker

I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated?

Hey! This was available in the following releases: v4.35.2 v4.35.1 v4.35.0 v4.34.1 v4.34.0 v4.33.3 v4.33.2 v4.33.1 v4.33.0 v4.32.1 v4.32.0 v4.31.0

ArthurZucker

Thank you for your response.

In the case of llama2 tokenizer, I have confirmed that all 8.56 billion tokens in datasets of famous LLMs are tokenized identically in both the fast tokenizer and slow tokenizer even with transformers version 4.31.0.

image

Awesome 🚀

Hey, I think the bug might be back.

I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for dinhanhx/llama-tokenizer-hf and mistralai/Mistral-7B-v0.3

Hey! Can you either share a small reproducer or share the tests you are running?