Similarity calculation takes too long with LASER3 model, compared to XLM-R/LaBSE

Question

Similarity calculation takes too long with LASER3 model, compared to XLM-R/LaBSE

aloka-fernando opened this issue 7 months ago · comments

For CCMatrix Si-Ta language pair, there are 215k sentence pairs. I need to calculate the similarity score between the sentence pairs using the latest LASER3 encoder. So I get a batch size of 100k and calculate the similarity scores as per the LASER3 documentation.

However, it is taking 4.5 Hrs to calculate the similarity score for 215 sentence-pairs on a machine with 16CPUs with 64GB RAM. The GPU has only 24GB memory, therefore the similarity is executed on the CPUs.

But with XLM-R and LaBSE, I can do 2M sentence pairs in 6 hrs.

Can I know what reasons might cause the LASER3 calculation to take such a long time compared to XLM-R and LaBSE. And what measures I can take to reduce the LASER3 calculation time? As next plan to run the same code for scoring En-Si and En-Ta which are close to 4M.

Highly appreciate your response.


src_lang="sin_Sinh"
tgt_lang="tam_Taml"

# Initialize the LASER encoder pipeline
src_encoder = LaserEncoderPipeline(lang=src_lang)
tgt_encoder = LaserEncoderPipeline(lang=tgt_lang)


def get_laser3_scores(src_sents, tgt_sents):
    src_embeddings = src_encoder.encode_sentences(src_sents, normalize_embeddings=True)
    tgt_embeddings = tgt_encoder.encode_sentences(tgt_sents, normalize_embeddings=True)
    scores_matrix = util.cos_sim(src_embeddings, tgt_embeddings)
    return [scores_matrix[i][i].item() for i in range(len(scores_matrix))]

David Dale · Answer 1 · Wed Apr 17 2024 00:08:13 GMT+0800 (China Standard Time)

One reason of slow computation is that for Sinhala and Tamil, the LASER2 encoder is used, which is an LSTM model, whereas XLM-R and LaBSE are transformers, which can benefit from better parallelization.

The GPU has only 24GB memory, therefore the similarity is executed on the CPUs.

LaserEncoderPipeline was tested on machines with 16GB of GPU memory, and with reasonably-sized sentences it worked well.
Thus, I suggest you to try porting your code to GPU; it should work OK and is going to give a huge speed boost.

In addition, please consider adapting the max_sentences or max_tokens properties of the wrapped models (e.g. src_encoder.encoder.max_sentences). If you use a GPU, then by increasing them to the largest possible value that still doesn't lead to out-of-memory errors you might achieve another speedup.

If you want to speedup computation even further, you might beed to distill the LASER encoder to some smaller and more efficient model. But the distillation itself is going to take a long time.