Firefox Translations Evaluation

Calculates BLEU and COMET scores for Firefox Translations models using bergamot-translator and compares them to other translation systems.

Running

We recommend running this on a Linux machine with at least one GPU, and inside a docker container. If you intend to run it on macOS, run the eval/evaluate.py script standalone inside a virtualenv, and skip the Start docker section below. You might need to manually install the correspondent packages in the Dockerfile in your system and virtual environment.

Clone repo

git clone https://github.com/mozilla/firefox-translations-evaluation.git
cd firefox-translations-evaluation

Download models

Use install/download-models.sh to get Firefox Translations models (be sure to have git-lfs enabled) or use your own ones.

Install NVIDIA Container Toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Start docker

Recommended memory size for Docker is 16gb.

export MODELS=<absolute path to a local directory with models>

# Specify Azure key and location if you want to add Azure Translator API for comparison
export AZURE_TRANSLATOR_KEY=<Azure translator resource API key>
# optional, specify if it's different than default 'global'
export AZURE_LOCATION=<location>

# Specify GCP credentials json path if you want to add Google Translator API for comparison
export GCP_CREDS_PATH=<absolute path to .json>

# Build and run docker container
bash start_docker.sh

On completion, your terminal should be attached to the launched container.

Run evaluation

From inside docker container run:

python3 eval/evaluate.py --translators=bergamot,microsoft,google --pairs=all --skip-existing --gpus=1 --evaluation-engine=comet,bleu --models-dir=/models/models/prod --results-dir=/models/evaluation/prod

If you don't have a GPU, use 0 in the --gpus argument.

More options:

python3 eval/evaluate.py --help

Details

Installation scripts

install/install-bergamot-translator.sh - clones and compiles bergamot-translator and marian (launched in docker image).

install/download-models.sh - downloads current Mozilla production models.

Docker & CUDA

The COMET evaluation framework supports CUDA, and you can enable it by setting the --gpus argument in the eval\evaluate.py script to the number of GPUs you wish to utilize (0 disables it). If you are using it, make sure you have the nvidia container toolkit enabled in your docker setup.

Translators

bergamot - uses compiled bergamot-translator in wasm mode
marian - uses compiled marian
google - users Google Translation API
microsoft - users Azure Cognitive Services Translator API

Reuse already calculated scores

Use --skip-existing option to reuse already calculated scores saved as results/xx-xx/*.bleu files. It is useful to continue evaluation if it was interrupted or to rebuild a full report reevaluating only selected translators.

Datasets

SacreBLEU - all available datasets for a language pair are used for evaluation.

Flores - parallel evaluation dataset for 101 languages.

Language pairs

With option --pairs=all, language pairs will be discovered in the specified models folder (option --models-dir) and evaluation will run for all of them.

Results

Results will be written to the specified directory (option --results-dir).

Evaluation results for models that are used in Firefox Translation can be found in firefox-translations-models/evaluation

andrenatal / firefox-translations-evaluation