facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

embed models

guangyuli-uoe opened this issue · comments

hi,

if i want to embed text in Chinese and text in English,

which model should i download ?

it says that 'LASER2 and all LASER3 encoders are downloaded by default'

where can i find them ?

Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt model. Can you try the following:

  1. Download a LASER2 model (and in this instance Wolof, but you can disregard that model for now): ./LASER/nllb/download_models.sh wol_Latn
  2. Go to LASER/tasks/embed/embed.sh and set the model directory (model_dir) within embed.sh to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to).
  3. Embed both Chinese and English texts using the following: ./embed.sh [infile] [outfile]

hi,

@heffernankevin

really really thanks for your kind replies and suggestions !!! ^^

but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,

(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd'
2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm
2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt
./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model $spm --output ${outfile} --verbose

Processing BUCC data in .

  • extract from tar bucc2018-fr-en.sample-gold.tar.bz2
  • extract from tar bucc2018-fr-en.test.tar.bz2
  • extract from tar bucc2018-fr-en.training-gold.tar.bz2
  • extract files ./embed/bucc2018.fr-en.dev in en
  • extract files ./embed/bucc2018.fr-en.dev in fr
  • extract files ./embed/bucc2018.fr-en.train in en
  • extract files ./embed/bucc2018.fr-en.train in fr
  • extract files ./embed/bucc2018.fr-en.test in en
  • extract files ./embed/bucc2018.fr-en.test in fr
    2022-07-19 21:54:32,562 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
    ./bucc.sh: line 82: 81096 Broken pipe: 13 cat ${txt}
    81097 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
    2022-07-19 21:54:34,664 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
    ./bucc.sh: line 82: 81100 Broken pipe: 13 cat ${txt}
    81101 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
    LASER: tool to search, score or mine bitexts
  • knn will run on CPU (slow)
  • loading texts ./embed/bucc2018.fr-en.train.txt.fr: 271874 lines, 270775 unique
  • loading texts ./embed/bucc2018.fr-en.train.txt.en: 369810 lines, 368033 unique
    Traceback (most recent call last):
    File "/Users/liguangyu/LASER/source/mine_bitexts.py", line 215, in
    x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose)
    File "/Users/liguangyu/LASER/source/embed.py", line 451, in EmbedLoad
    x = np.fromfile(fname, dtype=np.float32, count=-1)
    FileNotFoundError: [Errno 2] No such file or directory: './embed/bucc2018.fr-en.train.enc.fr'

Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.

python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder  /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose

hi @heffernankevin

thanks for your reply
here are the details:

2022-07-19 22:50:51,584 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb/laser2.spm
2022-07-19 22:50:51,584 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-19 22:50:51,584 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb/laser2.pt
zsh: segmentation fault python /Users/liguangyu/LASER/source/embed.py --input './1/doc.zh.txt'

@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch). There were similar issues on other repos which seem to be related to specific pytorch versions.

hi,
@heffernankevin

this is the current version:

Name: torch
Version: 1.12.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: fairseq, sentence-transformers, torchaudio, torchvision

Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).

hi, @heffernankevin

just want to make sure that the model: wol_Latn could handle both Chinese and English ?

The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt). This will be used by default using the embed.sh script e.g., embed.sh [infile] [outfile].