facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Change default tokenizer for sentence embeddings

julianpollmann opened this issue · comments

The standard Tokenizer for sentence embeddings is Moses. When specifying a token language that is not supported by Moses, this leads to a warning:

WARNING: No known abbreviations for language 'pes_Arab', attempting fall-back to English version...

I'm not sure what effect this has for the end result, but a correct tokenization might be good.
Laser2 and Laser3 models cover many languages, even languages which may be not tokenized properly (like example above).

One solution could be a feature to configure different Tokenizers.

Hi @julianpollmann! The Moses tokenizer was only used for LASER(1) models. LASER(2,3) migrated from this to use sentencepiece. When embedding using LASER2/3, I would recommend following this script. Hope this helps!