stefan-it / nmt-mk-en

Neural Machine Translation system for Macedonian to English

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Neural Machine Translation system for Macedonian to English

This repository contains all data and documentation for building a neural machine translation system for Macedonian to English. This work was done during the the M.Sc. course (summer term) Machine Translation held by Prof. Dr. Alex Fraser.

Dataset

The SETimes corpus contains of 207,777 parallel sentences for the Macedonian and English language pair.

For all experiments the corpus was split into training, development and test set:

Data set Sentences Download
Training 205,777 via GitHub or located in data/setimes.mk-en.train.tgz
Development 1,000 via GitHub or located in data/setimes.mk-en.dev.tgz
Test 1,000 via GitHub or located in data/setimes.mk-en.test.tgz

fairseq - Facebook AI Research Sequence-to-Sequence Toolkit

The first NMT system for Macedonian to English is built with fairseq. We trained three systems with different architectures:

  • Standard Bi-LSTM
  • CNN as encoder, LSTM as decoder
  • Fully convolutional

Preprocessing

All necessary scripts can be found in the scripts folder of this repository.

In the first step, we need to download and extract the parallel SETimes corpus for Macedonian to English:

wget http://nlp.ffzg.hr/data/corpora/setimes/setimes.en-mk.txt.tgz
tar -xf setimes.en-mk.txt.tgz

The data_preparation.sh scripts performs the following steps on the corpus:

  • download of the MOSES tokenizer script; tokenization of the whole corpus
  • download of the BPE scripts; learning and applying BPE on the corpus
./data_preparation setimes.en-mk.mk.txt setimes.en-mk.en.txt

After that the corpus is split into training, development and test set:

./split_dataset corpus.clean.bpe.32000.mk corpus.clean.bpe.32000.en

The following folder structure needs to be created:

mkdir {train,dev,test}

mv dev.* dev
mv train.* train
mv test.* test

mkdir model-data

After that the fairseq tool can be invoked to preprocess the corpus:

fairseq preprocess -sourcelang mk -targetlang en -trainpref train/train \
                   -validpref dev/dev -testpref test/test -thresholdsrc 3 \
                   -thresholdtgt 3 -destdir model-data

Training

After the preprossing steps the three models can be trained.

Standard Bi-LSTM

With the following command the bi-lstm model can be trained:

fairseq train -sourcelang mk -targetlang en -datadir model-data -model blstm \
              -nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \
              -savedir model-blstm

CNN as encoder, LSTM as decoder

With the following command the CNN as encoder, LSTM as decoder model can be trained:

fairseq train -sourcelang mk -targetlang en -datadir model-data -model conv \
              -nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir model-conv

Fully convolutional

With the following command the fully convolutional model can be trained:

fairseq train -sourcelang mk -targetlang en -datadir model-data -model fconv \
              -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 \
              -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir model-fconv

Decoding

Standard Bi-LSTM

With the following command the bi-lstm model can decode the test set:

fairseq generate -sourcelang mk -targetlang en \
                 -path model-blstm/model_best.th7 -datadir model-data -beam 10 \
                 -nbest 1 -dataset test > model-blstm/system.output

CNN as encoder, LSTM as decoder

With the following command the CNN as encoder, LSTM as decoder model can decode the test set:

fairseq generate -sourcelang mk -targetlang en -path model-conv/model_best.th7 \
                 -datadir model-data -beam 10 -nbest 1 \
                 -dataset test > model-conv/system.output

Fully convolutional

With the following command the fully convolutional model can decode the test set:

fairseq generate -sourcelang mk -targetlang en -path model-fconv/model_best.th7 \
                 -datadir model-data -beam 10 -nbest 1 \
                 -dataset test > model-fconv/system.output

Calculating the BLEU-score

With the helper script fairseq_bleu.sh the BLEU-score of all models can be calculated very easy. The script expects the system output file as command line argument:

./fairseq_bleu.sh model-blstm/system.output

Results

We use different BPE merge operations: 16.000 and 32.000. Here are the results on the final test set:

Model BPE merge operations BLEU-Score
Bi-LSTM 32.000 46,84
Bi-LSTM 16.000 47,57
CNN encoder, LSTM decoder 32.000 19,83
CNN encoder, LSTM decoder 16.000 9,59
Fully convolutional 32.000 48,81
Fully convolutional 16.000 49,03

The best bleu-score was obtained with the fully convolutional model with 16.000 merge operations.

tensor2tensor - Transformer

The second NMT system for Macedonian to English is built with the tensor2tensor library. We trained two systems: one subword-based system and one character-based NMT system.

Notice: The problem description for this task is found in translate_enmk.py in the root repository. This problem was once directly included and available in tensor2tensor. But I decided to replace the integrated tensor2tensor problem for Macedonian to English with a more challenging one. To replicate all experiments in this repository, the translate_enmk.py problem is now a user-defined problem and must be included in the following way:

cp translate_enmk.py /tmp
echo "from . import my_submodule" > /tmp/__init__.py

To use this problem, the --t2t_usr_dir commandline option must point to the appropriate folder (in this example /tmp). For more information about user-defined problems, see offical documentation.

Training (Transformer base)

The following training steps are tested with tensor2tensor in version 1.5.1.

First, we create the initial directory structure:

mkdir -p t2t_data t2t_datagen t2t_train t2t_output

In the next step, the training and development datasets are downloaded and prepared:

t2t-datagen --data_dir=t2t_data --tmp_dir=t2t_datagen/ \
  --problem=translate_enmk_setimes32k --t2t_usr_dir /tmp

Then the training step can be started:

t2t-trainer --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
  --model=transformer --hparams_set=transformer_base --output_dir=t2t_output \
  --t2t_usr_dir /tmp

The number of GPUs used for training can be specified with the --worker_gpu option.

Decoding

In the next step, the test dataset is downloaded and extracted:

wget "https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz"
tar -xzf setimes.mk-en.test.tgz

Then the decoding step for the test dataset can be started:

t2t-decoder --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
  --model=transformer --decode_hparams="beam_size=4,alpha=0.6" \
  --decode_from_file=test.mk --decode_to_file=system.output \
  --hparams_set=transformer_big --output_dir=t2t_output/ \
  --t2t_usr_dir /tmp

Calculating the BLEU-score

The BLEU-score can be calculated with the built-in t2t-bleu tool:

t2t-bleu --translation=system.output --reference=test.en

Results

The following results can be achieved using the Transformer model. A character-based model was also trained and measured. A big transformer model was also trained using tensor2tensor in version 1.2.9 (latest version has a bug, see this issue).

Model BLEU-Score
Transformer 54,00 (uncased)
Transformer (big) 43,74 (uncased)
Transformer (char-based) 37.43 (uncased)

Further work

We want to train a char-based NMT system with the dl4mt-c2c library in near future.

Acknowledgments

We would like to thank the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften (LRZ) for giving us access to the NVIDIA DGX-1 supercomputer.

Presentations

About

Neural Machine Translation system for Macedonian to English


Languages

Language:Shell 58.1%Language:Python 41.9%