embed models

Question

embed models

guangyuli-uoe opened this issue 2 years ago · comments

guangyuli-uoe commented 2 years ago

hi,

if i want to embed text in Chinese and text in English,

which model should i download ?

guangyuli-uoe · Answer 1 · Tue Jul 19 2022 23:36:53 GMT+0800 (China Standard Time)

it says that 'LASER2 and all LASER3 encoders are downloaded by default'

where can i find them ?

Kevin Heffernan · Answer 2 · Wed Jul 20 2022 00:27:21 GMT+0800 (China Standard Time)

Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt model. Can you try the following:

Download a LASER2 model (and in this instance Wolof, but you can disregard that model for now): ./LASER/nllb/download_models.sh wol_Latn
Go to LASER/tasks/embed/embed.sh and set the model directory (model_dir) within embed.sh to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to).
Embed both Chinese and English texts using the following: ./embed.sh [infile] [outfile]

guangyuli-uoe · Answer 3 · Wed Jul 20 2022 05:04:04 GMT+0800 (China Standard Time)

hi,

@heffernankevin

really really thanks for your kind replies and suggestions !!! ^^

but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,

(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd'
2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm
2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt
./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model $spm --output ${outfile} --verbose

Processing BUCC data in .

extract from tar bucc2018-fr-en.sample-gold.tar.bz2
extract from tar bucc2018-fr-en.test.tar.bz2
extract from tar bucc2018-fr-en.training-gold.tar.bz2
extract files ./embed/bucc2018.fr-en.dev in en
extract files ./embed/bucc2018.fr-en.dev in fr
extract files ./embed/bucc2018.fr-en.train in en
extract files ./embed/bucc2018.fr-en.train in fr
extract files ./embed/bucc2018.fr-en.test in en
extract files ./embed/bucc2018.fr-en.test in fr
2022-07-19 21:54:32,562 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
./bucc.sh: line 82: 81096 Broken pipe: 13 cat ${txt}
81097 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
2022-07-19 21:54:34,664 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
./bucc.sh: line 82: 81100 Broken pipe: 13 cat ${txt}
81101 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
LASER: tool to search, score or mine bitexts
knn will run on CPU (slow)
loading texts ./embed/bucc2018.fr-en.train.txt.fr: 271874 lines, 270775 unique
loading texts ./embed/bucc2018.fr-en.train.txt.en: 369810 lines, 368033 unique
Traceback (most recent call last):
File "/Users/liguangyu/LASER/source/mine_bitexts.py", line 215, in
x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose)
File "/Users/liguangyu/LASER/source/embed.py", line 451, in EmbedLoad
x = np.fromfile(fname, dtype=np.float32, count=-1)
FileNotFoundError: [Errno 2] No such file or directory: './embed/bucc2018.fr-en.train.enc.fr'

Kevin Heffernan · Answer 4 · Wed Jul 20 2022 05:15:29 GMT+0800 (China Standard Time)

Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.

python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder  /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose

guangyuli-uoe · Answer 5 · Wed Jul 20 2022 05:53:22 GMT+0800 (China Standard Time)

hi @heffernankevin

thanks for your reply
here are the details:

Kevin Heffernan · Answer 6 · Wed Jul 20 2022 06:12:45 GMT+0800 (China Standard Time)

@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch). There were similar issues on other repos which seem to be related to specific pytorch versions.

guangyuli-uoe · Answer 7 · Wed Jul 20 2022 06:19:07 GMT+0800 (China Standard Time)

hi,
@heffernankevin

this is the current version:

Name: torch
Version: 1.12.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: fairseq, sentence-transformers, torchaudio, torchvision

Kevin Heffernan · Answer 8 · Wed Jul 20 2022 23:06:34 GMT+0800 (China Standard Time)

Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).

guangyuli-uoe · Answer 9 · Wed Jul 20 2022 23:10:03 GMT+0800 (China Standard Time)

hi, @heffernankevin

just want to make sure that the model: wol_Latn could handle both Chinese and English ?

Kevin Heffernan · Answer 10 · Wed Jul 20 2022 23:12:46 GMT+0800 (China Standard Time)

The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt). This will be used by default using the embed.sh script e.g., embed.sh [infile] [outfile].