yahsieh37 / MTQE

Machine translation quality estimation - pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

Task: Machine translation quality estimation

Sentence-level Quality Estimation Shared Task of WMT20

Datasets

  1. QE data from WMT20: https://github.com/facebookresearch/mlqe
  2. English-German and English-Chinese parallel data from News-Commentary

To run the codes

  1. Set up configurations in config.py
  2. Type in the following command:
python main.py -m <option> \\
               -d <dataset type> \\
               -f <data type>

Five options: train, validate, predict, evaluate, ensemble
Dataset types: train, valid, test
Data types: */*.tsv for all files in Dataset folder specified above.

  1. Or run scripts directly as follows:
** Evaluation **
bash run_evaluate.sh

** Ensemble **
bash run_ensemble.sh

Baseline model

  1. OpenKiwi tool + mbert pretrained vectors

    a) API page

    b) 2019 QE + BERT/XLM

    c) 2018 QEbrain transformer code
    2018 QEbrain transformer
    2018 QEbrain original transformer model

    d) 2018 Automatic Post-eding

    e) 2017 QE + Bilstm
    2017 QE + Bilstm on WMT17 task

Submission Link:

https://competitions.codalab.org/competitions/24207

Experiments:

  1. Transformer-based predictor
  2. NCE and NEG loss
  3. Fine-tuned pretrained models provided by WMT20
  4. Additional parallel data for en-de and en-zh pairs
  5. Ensembles

Usage:

  1. All configurations can be set in the Config file. Some important configurations includes:

    1. Trained model (Bilstmpredictor, Estimator, ...)
    2. Paths for saving and loading checkpoints
    3. Used language pairs
    4. Hyper-parameters (epochs, batch size, learning rate, ...)
  2. All of the pipeline can be run with the Main file, which includes:

    1. Train
    2. Predict
    3. Evaluate

About

Machine translation quality estimation - pytorch


Languages

Language:Jupyter Notebook 96.1%Language:Python 3.9%Language:Shell 0.1%