Change default tokenizer for sentence embeddings

Question

Change default tokenizer for sentence embeddings

julianpollmann opened this issue a year ago · comments

The standard Tokenizer for sentence embeddings is Moses. When specifying a token language that is not supported by Moses, this leads to a warning:

WARNING: No known abbreviations for language 'pes_Arab', attempting fall-back to English version...

I'm not sure what effect this has for the end result, but a correct tokenization might be good.
Laser2 and Laser3 models cover many languages, even languages which may be not tokenized properly (like example above).

One solution could be a feature to configure different Tokenizers.

Kevin Heffernan · Answer 1 · Fri Aug 04 2023 18:55:04 GMT+0800 (China Standard Time)

Hi @julianpollmann! The Moses tokenizer was only used for LASER(1) models. LASER(2,3) migrated from this to use sentencepiece. When embedding using LASER2/3, I would recommend following this script. Hope this helps!