Text preprocessing before embedding

Question

Text preprocessing before embedding

mrkcdl opened this issue 2 years ago · comments

Hello there,

we are using LASER 2 to embed sentences in multiple languages and compare them. I would like to ask if it is recommended to use any form of text preprocessing before feeding the sentences to LASER. Is LASER trained to work with unprocessed sentences, or will it have a negative impact on the vectors and the ability to precisely compare them? If so, what preprocessing methods (removing stopwords, etc.) should we choose?

Thanks in advance.

Kevin Heffernan · Answer 1 · Fri Mar 03 2023 03:32:40 GMT+0800 (China Standard Time)

Hi @mrkcdl! There are various preprocessing steps such as lowercasing, and punctuation normalization. However, it is recommended to use the embed script provided, as this will take care of any necessary preprocessing steps for you.

Kevin Heffernan · Answer 2 · Thu Mar 23 2023 02:19:28 GMT+0800 (China Standard Time)

Closing due to inactivity (and hopefully issue is solved). Please re-open if needed!