moses-smt / mgiza

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with building probabilistic dictionaries

Syrkovski opened this issue · comments

Hello, I tried to build probabilistic dictionaries (I need it for training Becleaner model), but as a result I get something like:

afterwards NULL 0.0000124
pension NULL 0.0000372
truss NULL 0.0000124
birthday NULL 0.0000744
commemorate NULL 0.0000248

Entire second column is "NULL"

The command I used is:
mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir bicleaner_inf/ --corpus bicleaner_inf/corpus.clean --e en --f zh --mgiza -mgiza-cpus 8 --parallel --first-step 1 --last-step 4 --external-bin-dir mgiza/mgizapp/bin/

It looks like major error occurs in mgiza:

Merging A3.final.part* tables
Executing: enchmodels/mgiza/mgizapp/bin/merge_alignment.py enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final.part*> enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final
Traceback (most recent call last):
File "enchmodels/mgiza/mgizapp/bin/merge_alignment.py", line 32, in
st1 = files[i].readline();
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 84: ordinal not in range(128)
Exit code: 1

And after it gives the whole chunk of errors like:

Use of uninitialized value $a in scalar chomp at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 105
Use of uninitialized value in substitution (s///) at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 40.

Solved this problem

commented

Solved this problem

Seems like the best way is to recompile MGIZA
I used the instructions here:
https://hovinh.github.io/blog/2016-04-29-install-mgiza-ubuntu/