PlanTL-GOB-ES / lm-spanish

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizer rare subwords

garbanlp opened this issue · comments

Hi, First at all many thanks for this amazing work in Spanish 😍.

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer

model_checkpoint = "PlanTL-GOB-ES/roberta-large-bne"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

rare_tokens_vocab = [word for word in tokenizer.vocab if 'Ġ' in word]
print(len(rare_tokens_vocab))
# 37695 50262

Almost 75% or tokens contains "Ġ" char. It is really strange! Probably due to dirty text in the corpus ?? or why is the reason of so many tokens with "Ġ"??

I just discover that "Ġ" is the same as initial space of the word

print(tokenizer.vocab['Ġque'])
341
print(tokenizer.decode(341))
' que'