Weird inconsistency in Tokenizer vocabulary

Question

Weird inconsistency in Tokenizer vocabulary

javirandor opened this issue 5 months ago · comments

Hello everyone!

I found a weird inconsistency in the tokenizer vocabulary. I wanted to ask why this could be happening.

I have loaded a tokenizer from HF:

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

If I run

tokenizer.encode("\u200b")

The output is [12882]. However, taking a look at the vocabulary used for training (here), I cannot find the token \u200b and the token id corresponds to a different string

"\u00e2\u0122\u012d": 12882,

This seems to generally happen with unicode characters.

Why could this be happening?? I just want to make sure that the tokenizer I use for training is equivalent to the HF tokenizers since my training (as anticipated in your README) results in a weird tokenizer.

Thanks a lot :)

Hailey Schoelkopf · Answer 1 · Fri Mar 08 2024 01:27:08 GMT+0800 (China Standard Time)

I don't know exactly what's going on here yet, but I can confirm this file at utils/20B_tokenizer.json is precisely the one used for vocab_file during Pythia training.

also the following snippet shows the result upon loading the two tokenizers and encoding \u200b:

>>> tok1 = transformers.PretrainedTokenizerFast.from_file("utils/20B_tokenizer.json")
>>> tok2 = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

>>> tok1("\u200b")
{'input_ids': [12882], 'token_type_ids': [0], 'attention_mask': [1]}
>>> tok2("\u200b")
{'input_ids': [12882], 'attention_mask': [1]}