microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer

tomateb opened this issue · comments

In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:

For example the following code:

tok_hugging_face = RobertaTokenizer.from_pretrained('roberta-base')
tok_blingfire = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "roberta.bin"))

sentence = "test"
print(f'Sentence - {sentence}')  
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')  
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 1, 100)}')  
print()
sentence = "something test"
print(f'Sentence - {sentence}')
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 2, 100)}')

Produces the following output:

Sentence - test
Hugging Face - [0, 21959, 2]
BlingFire - [1296]

Sentence - something test
Hugging Face - [0, 18891, 1296, 2]
BlingFire - [ 402 1296]

In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.

I think it has to do with the add_prefix_space=True / False parameter that Huggingface has, probably the default behaviour is different. Could you please try adding blingfire.change_settings_dummy_prefix(h, False) call after you have loaded the model, as shown here:

#82 (comment)