embed models
guangyuli-uoe opened this issue · comments
hi,
if i want to embed text in Chinese and text in English,
which model should i download ?
it says that 'LASER2 and all LASER3 encoders are downloaded by default'
where can i find them ?
Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt
model. Can you try the following:
- Download a LASER2 model (and in this instance Wolof, but you can disregard that model for now):
./LASER/nllb/download_models.sh wol_Latn
- Go to
LASER/tasks/embed/embed.sh
and set the model directory (model_dir
) withinembed.sh
to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to). - Embed both Chinese and English texts using the following:
./embed.sh [infile] [outfile]
hi,
really really thanks for your kind replies and suggestions !!! ^^
but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,
(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd'
2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm
2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt
./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model
Processing BUCC data in .
- extract from tar bucc2018-fr-en.sample-gold.tar.bz2
- extract from tar bucc2018-fr-en.test.tar.bz2
- extract from tar bucc2018-fr-en.training-gold.tar.bz2
- extract files ./embed/bucc2018.fr-en.dev in en
- extract files ./embed/bucc2018.fr-en.dev in fr
- extract files ./embed/bucc2018.fr-en.train in en
- extract files ./embed/bucc2018.fr-en.train in fr
- extract files ./embed/bucc2018.fr-en.test in en
- extract files ./embed/bucc2018.fr-en.test in fr
2022-07-19 21:54:32,562 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
./bucc.sh: line 82: 81096 Broken pipe: 13 cat ${txt}
81097 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
2022-07-19 21:54:34,664 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt
./bucc.sh: line 82: 81100 Broken pipe: 13 cat ${txt}
81101 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose
LASER: tool to search, score or mine bitexts - knn will run on CPU (slow)
- loading texts ./embed/bucc2018.fr-en.train.txt.fr: 271874 lines, 270775 unique
- loading texts ./embed/bucc2018.fr-en.train.txt.en: 369810 lines, 368033 unique
Traceback (most recent call last):
File "/Users/liguangyu/LASER/source/mine_bitexts.py", line 215, in
x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose)
File "/Users/liguangyu/LASER/source/embed.py", line 451, in EmbedLoad
x = np.fromfile(fname, dtype=np.float32, count=-1)
FileNotFoundError: [Errno 2] No such file or directory: './embed/bucc2018.fr-en.train.enc.fr'
Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.
python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose
thanks for your reply
here are the details:
2022-07-19 22:50:51,584 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb/laser2.spm
2022-07-19 22:50:51,584 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-19 22:50:51,584 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb/laser2.pt
zsh: segmentation fault python /Users/liguangyu/LASER/source/embed.py --input './1/doc.zh.txt'
@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch
). There were similar issues on other repos which seem to be related to specific pytorch versions.
hi,
@heffernankevin
this is the current version:
Name: torch
Version: 1.12.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages
Requires: typing-extensions
Required-by: fairseq, sentence-transformers, torchaudio, torchvision
Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).
hi, @heffernankevin
just want to make sure that the model: wol_Latn could handle both Chinese and English ?
The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt
). This will be used by default using the embed.sh
script e.g., embed.sh [infile] [outfile]
.