Laser v2, set Language to compute Embeddings on

Question

Laser v2, set Language to compute Embeddings on

celcof opened this issue a year ago · comments

Thank you for your work.

From the Readme related to task Embed I read that while for Laser v3 languages there needs to be a manual selection of the variable to be passed, for the 93 Laser v2 supported language ./embed.sh input_file output_file is sufficient. How is it able to understand the language then? Does it default to english? I tried to look at the embed.py file but could not understand the argument to play on (changing --spm-lang for example seems not to have an effect on the output).

Kevin Heffernan · Answer 1 · Wed Jun 07 2023 03:29:24 GMT+0800 (China Standard Time)

Hi @celcof! The command: ./embed.sh input_file output_file will indeed default to the LASER v2 model. However, this model is multilingual and natively supports 93 languages. Hope this helps!

Francesco Cabras · Answer 2 · Wed Jun 07 2023 17:16:30 GMT+0800 (China Standard Time)

Thank you for your answer.

My doubt comes from the fact that from Laserembeddings v1 it was necessary to also specify the language of the input. I guess that in this novel version this does not happen anymore and there is a sort of language detection running in the backend?

Kevin Heffernan · Answer 3 · Wed Jun 07 2023 20:06:08 GMT+0800 (China Standard Time)

In the older version of LASER this was usually needed for MOSES (language-specific) word tokenization. However, LASER2 uses sentencepiece and so this is no longer needed :)