Tokenizer rare subwords
garbanlp opened this issue · comments
garbanlp commented
Hi, First at all many thanks for this amazing work in Spanish 😍.
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer
model_checkpoint = "PlanTL-GOB-ES/roberta-large-bne"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
rare_tokens_vocab = [word for word in tokenizer.vocab if 'Ġ' in word]
print(len(rare_tokens_vocab))
# 37695 50262
Almost 75% or tokens contains "Ġ" char. It is really strange! Probably due to dirty text in the corpus ?? or why is the reason of so many tokens with "Ġ"??
garbanlp commented
I just discover that "Ġ" is the same as initial space of the word
print(tokenizer.vocab['Ġque'])
341
print(tokenizer.decode(341))
' que'