facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Laser v2, set Language to compute Embeddings on

celcof opened this issue · comments

Thank you for your work.

From the Readme related to task Embed I read that while for Laser v3 languages there needs to be a manual selection of the variable to be passed, for the 93 Laser v2 supported language ./embed.sh input_file output_file is sufficient. How is it able to understand the language then? Does it default to english? I tried to look at the embed.py file but could not understand the argument to play on (changing --spm-lang for example seems not to have an effect on the output).

Hi @celcof! The command: ./embed.sh input_file output_file will indeed default to the LASER v2 model. However, this model is multilingual and natively supports 93 languages. Hope this helps!

Thank you for your answer.

My doubt comes from the fact that from Laserembeddings v1 it was necessary to also specify the language of the input. I guess that in this novel version this does not happen anymore and there is a sort of language detection running in the backend?

In the older version of LASER this was usually needed for MOSES (language-specific) word tokenization. However, LASER2 uses sentencepiece and so this is no longer needed :)