Embedding tasks stop at pre-processing phase.

Question

Embedding tasks stop at pre-processing phase.

Vietdung113 opened this issue 2 years ago · comments

I'm currently using LASOR to embed my documents.
My command is:

./embed.sh ~/source/samsung/vecalign/bleualign_data/overlaps.vi ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb [vi]

But I got the error.
.

Can anyone face the problem? How can I fix that?

Viet Dung · Answer 1 · Tue Aug 23 2022 14:44:15 GMT+0800 (China Standard Time)

I've also tried a command:

python embed.py --input ~/source/samsung/vecalign/bleualign_data/overlaps.vi --output ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb --encoder ../models/laser2.pt --spm-model ../models/laser2.spm --verbose

And I still got this error.
Format of my input file is:

Kevin Heffernan · Answer 2 · Tue Aug 23 2022 18:03:19 GMT+0800 (China Standard Time)

Hi @Vietdung113, as the error seems to be coming up during the spm encode step, can you try running the following command (from the same directory you ran embed.py) to pinpoint the error:

cat ~/source/samsung/vecalign/bleualign_data/overlaps.vi | ../tools-external/sentencepiece-master/build/src/spm_encode --output_format=piece --model ../models/laser2.spm

Viet Dung · Answer 3 · Wed Aug 24 2022 11:17:25 GMT+0800 (China Standard Time)

Hi @heffernankevin.
I tried the command.
However, I still got this error.

Furthermore. It seems to be good if I run it on GPU instead of CPU by default.

Kevin Heffernan · Answer 4 · Wed Aug 24 2022 16:29:57 GMT+0800 (China Standard Time)

@Vietdung113 I wonder if this is a memory-related issue. From your example input file, it looks like you're concatenating sentences to each other. As the sentence length grows, perhaps this quickly runs out of memory for the non-GPU machine you're using? Can you try either filtering out very long sentences, or perhaps try splitting the input file into multiple shards?

If the issue persists, I'd recommend perhaps getting in touch with the sentencepiece maintainers, since they would be best to help solve this problem.

Kevin Heffernan · Answer 5 · Wed Aug 31 2022 23:21:44 GMT+0800 (China Standard Time)

Closing due to inactivity (and hopefully issue is solved by sentencepiece maintainers). Please re-open if the issue persists.