facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text preprocessing before embedding

mrkcdl opened this issue · comments

Hello there,

we are using LASER 2 to embed sentences in multiple languages and compare them. I would like to ask if it is recommended to use any form of text preprocessing before feeding the sentences to LASER. Is LASER trained to work with unprocessed sentences, or will it have a negative impact on the vectors and the ability to precisely compare them? If so, what preprocessing methods (removing stopwords, etc.) should we choose?

Thanks in advance.

Hi @mrkcdl! There are various preprocessing steps such as lowercasing, and punctuation normalization. However, it is recommended to use the embed script provided, as this will take care of any necessary preprocessing steps for you.

Closing due to inactivity (and hopefully issue is solved). Please re-open if needed!