Tokenizer rare subwords

Question

Tokenizer rare subwords

garbanlp opened this issue a year ago · comments

Hi, First at all many thanks for this amazing work in Spanish 😍.

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer

model_checkpoint = "PlanTL-GOB-ES/roberta-large-bne"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

rare_tokens_vocab = [word for word in tokenizer.vocab if 'Ġ' in word]
print(len(rare_tokens_vocab))
# 37695 50262

Almost 75% or tokens contains "Ġ" char. It is really strange! Probably due to dirty text in the corpus ?? or why is the reason of so many tokens with "Ġ"??

garbanlp · Answer 1 · Tue Jul 11 2023 21:44:12 GMT+0800 (China Standard Time)

I just discover that "Ġ" is the same as initial space of the word

print(tokenizer.vocab['Ġque'])
341
print(tokenizer.decode(341))
' que'