Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer

Question

Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer

tomateb opened this issue 3 years ago · comments

In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:

For example the following code:

tok_hugging_face = RobertaTokenizer.from_pretrained('roberta-base')
tok_blingfire = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "roberta.bin"))

sentence = "test"
print(f'Sentence - {sentence}')  
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')  
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 1, 100)}')  
print()
sentence = "something test"
print(f'Sentence - {sentence}')
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 2, 100)}')

Produces the following output:

Sentence - test
Hugging Face - [0, 21959, 2]
BlingFire - [1296]

Sentence - something test
Hugging Face - [0, 18891, 1296, 2]
BlingFire - [ 402 1296]

In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.

Sergei Alonichau · Answer 1 · Sat Oct 02 2021 01:48:23 GMT+0800 (China Standard Time)

I think it has to do with the add_prefix_space=True / False parameter that Huggingface has, probably the default behaviour is different. Could you please try adding blingfire.change_settings_dummy_prefix(h, False) call after you have loaded the model, as shown here:

#82 (comment)