Easy-Translate is a script for translating large text files in your machine using the M2M100 models and NLLB200 models from Facebook/Meta AI. We also provide a script for Easy-Evaluation of your translations 🥳
Easy-Translate is built on top of 🤗HuggingFace's Transformers and 🤗HuggingFace's Accelerate library.
We currently support:
- CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
- BF16 / FP16 / FP32 precision.
- Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
- Sharded Data Parallel to load huge models sharded on multiple GPUs (See: https://huggingface.co/docs/accelerate/fsdp).
- Greedy decoding / Beam Search decoding / Multinomial Sampling / Beam-Search Multinomial Sampling
Test the 🔌 Online Demo here: https://huggingface.co/spaces/Iker/Translate-100-languages
See the Supported languages table for a table of the supported languages and their ids.
M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this paper and first released in this repository.
M2M100 can directly translate between 9,900 directions of 100 languages.
-
Facebook/m2m100_418M: https://huggingface.co/facebook/m2m100_418M
-
Facebook/m2m100_1.2B: https://huggingface.co/facebook/m2m100_1.2B
-
Facebook/m2m100_12B: https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt
No Language Left Behind (NLLB) open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences. It was introduced in this paper and first released in this repository.
NLLB can directly translate between +40,000 of +200 languages.
-
facebook/nllb-200-3.3B: https://huggingface.co/facebook/nllb-200-3.3B
-
facebook/nllb-200-1.3B: https://huggingface.co/facebook/nllb-200-1.3B
-
facebook/nllb-200-distilled-1.3B: https://huggingface.co/facebook/nllb-200-distilled-1.3B
-
facebook/nllb-200-distilled-600M: https://huggingface.co/facebook/nllb-200-distilled-600M
Any other ModelForSeq2SeqLM from HuggingFace's Hub should work with this library: https://huggingface.co/models?pipeline_tag=text2text-generation
If you use this software please cite
@inproceedings{garcia-ferrero-etal-2022-model,
title = "Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings",
author = "Garc{\'\i}a-Ferrero, Iker and
Agerri, Rodrigo and
Rigau, German",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.478",
pages = "6403--6416",
}
Pytorch >= 1.10.0
See: https://pytorch.org/get-started/locally/
Accelerate >= 0.12.0
pip install --upgrade accelerate
HuggingFace Transformers
pip install --upgrade transformers
If you find errors using NLLB200, try installing transformers from source:
pip install git+https://github.com/huggingface/transformers.git
Run python translate.py -h
for more info.
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
See Accelerate documentation for more information (multi-node, TPU, Sharded model...): https://huggingface.co/docs/accelerate/index
You can use the Accelerate CLI to configure the Accelerate environment (Run accelerate config
in your terminal) instead of using the --multi_gpu and --num_processes
flags.
# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the --starting_batch_size 128
flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.
Use the --precision
flag to choose the precision of the model. You can choose between: bf16, fp16 and 32.
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--precision fp16
You can choose the decoding/sampling strategy to use and the number of candidate translation to output for each input sentence. By default we will use beam-search with 'num_beams' set to 5, and we will output the most likely candidate translation. But you can change this behavior:
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8 \
--num_return_sequences 1
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8
To run the evaluation script you need to install bert_score: pip install bert_score
and 🤗HuggingFace's Datasets model: pip install datasets
.
The evaluation script will calculate the following metrics:
Run the following command to evaluate the translations:
accelerate launch eval.py \
--pred_path sample_text/en2es.translation.m2m100_1.2B.txt
--gold_path sample_text/es.txt \
If you want to save the results to a file use the --output_path
flag.
See sample_text/en2es.m2m100_1.2B.json for a sample output.