facebookresearch / LASER

Despite the mandatory preprocess of lowercasing everything, I found many tokens that contain uppercase characters.

$ grep -c '[A-Z]' laser2.cvocab
2741

which means that 2741 token spots of the vocabulary (and probably also in the input embedding layer) are being wasted?

They are not wasted, because word case often affects the semantics.
Here are some illustrative examples collected by community: https://www.quora.com/What-words-change-their-meaning-when-they-are-uppercase-or-lowercase.

Yeah, of course casing affects the semantics. But it is explicitly stated in the code that lowercase is needed for all the models and it is also applied before tokenizing with SentencePiece.

LASER/source/lib/text_processing.py

Line 140 in faf08e8

+ '|' + ROMAN_LC + 'none'

Therefore, neither SentencePiece or the model see any of those upper case characters.

Hi @ZJaume, thanks for pointing this out! Indeed when I just checked myself I see upper-case characters in the trained vocabulary. It's possible a small amount of upper-cased data ended up in the sample used to train the sentencepiece model. However I would recommend continuing to use lower-cased inputs for inference since the training data was not cased.

LASER2 vocab contains upper case characters