Overview
Task: Machine translation quality estimation
Sentence-level Quality Estimation Shared Task of WMT20
Datasets
- QE data from WMT20: https://github.com/facebookresearch/mlqe
- English-German and English-Chinese parallel data from News-Commentary
To run the codes
- Set up configurations in
config.py
- Type in the following command:
python main.py -m <option> \\
-d <dataset type> \\
-f <data type>
Five options: train
, validate
, predict
, evaluate
, ensemble
Dataset types: train
, valid
, test
Data types: */*.tsv
for all files in Dataset folder specified above.
- Or run scripts directly as follows:
** Evaluation **
bash run_evaluate.sh
** Ensemble **
bash run_ensemble.sh
Baseline model
-
OpenKiwi tool + mbert pretrained vectors
a) API page
c) 2018 QEbrain transformer code
2018 QEbrain transformer
2018 QEbrain original transformer model
Submission Link:
https://competitions.codalab.org/competitions/24207
Experiments:
- Transformer-based predictor
- NCE and NEG loss
- Fine-tuned pretrained models provided by WMT20
- Additional parallel data for en-de and en-zh pairs
- Ensembles
Usage:
-
All configurations can be set in the Config file. Some important configurations includes:
- Trained model (Bilstmpredictor, Estimator, ...)
- Paths for saving and loading checkpoints
- Used language pairs
- Hyper-parameters (epochs, batch size, learning rate, ...)
-
All of the pipeline can be run with the Main file, which includes:
- Train
- Predict
- Evaluate