facebookresearch / LASER

Language-Agnostic SEntence Representations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Embedding tasks stop at pre-processing phase.

Vietdung113 opened this issue · comments

I'm currently using LASOR to embed my documents.
My command is:

./embed.sh ~/source/samsung/vecalign/bleualign_data/overlaps.vi ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb [vi] 

But I got the error.
image.

Can anyone face the problem? How can I fix that?

I've also tried a command:

python embed.py --input ~/source/samsung/vecalign/bleualign_data/overlaps.vi --output ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb --encoder ../models/laser2.pt --spm-model ../models/laser2.spm --verbose

And I still got this error.
Format of my input file is:
image

Hi @Vietdung113, as the error seems to be coming up during the spm encode step, can you try running the following command (from the same directory you ran embed.py) to pinpoint the error:

cat ~/source/samsung/vecalign/bleualign_data/overlaps.vi | ../tools-external/sentencepiece-master/build/src/spm_encode --output_format=piece --model ../models/laser2.spm

Hi @heffernankevin.
I tried the command.
However, I still got this error.
image
Furthermore. It seems to be good if I run it on GPU instead of CPU by default.

@Vietdung113 I wonder if this is a memory-related issue. From your example input file, it looks like you're concatenating sentences to each other. As the sentence length grows, perhaps this quickly runs out of memory for the non-GPU machine you're using? Can you try either filtering out very long sentences, or perhaps try splitting the input file into multiple shards?

If the issue persists, I'd recommend perhaps getting in touch with the sentencepiece maintainers, since they would be best to help solve this problem.

Closing due to inactivity (and hopefully issue is solved by sentencepiece maintainers). Please re-open if the issue persists.