Example application for the task of fine-tuning pretrained machine translation models on highly domain-specific translated sentences.
For this, likely translation pairs are first extracted from the original versions and the German book translations of the Harry Potter fantasy novel series using a Translated Sentence Mining approach. The extracted sentence translations are then used to fine-tune two baseline machine translation models (pre-trained model MarianMT for translation from English to German and Google's Text-To-Text Transfer Transformer T5).
Afterwards, some metrics are calculated to evaluate the performance boost from fine-tuning the models.
- Split the unaligned txt files for each book and its translation file into sentences using Lingtrain Aligners splitter and preprocessor
- Calculate language-independent sentence level embeddings for the split sentences using GoogleAI's Language-Agnostic BERT Sentence Embeddings (LaBSE) in the Sentence Transformers framework
- Match the best fitting translation pairs for all sentences using K-Nearest Neighbors search, mostly following Sentence Transformers example application for Translated Sentence Mining
- Filter the sentence pairs by a minimum similarity score
- Remove sentence pairs containing sentences shorter than 20 or longer than 200 characters
- Split the resulting corpus of ~54.000 likely parallel sentences randomly into a train, validation and test set (80%, 10%, 10%)
- Load the pre-trained models
Helsinki-NLP/opus-mt-en-de
(MarianMTModel) andt5-base
(T5ForConditionalGeneration) in huggingface - Fine-tune the models on the extracted parallel sentences using the train and evaluation set for 10 epochs each (training time: 03h-04m-45s for MarianMT and 09h-20m-10s for T5 on NVIDIA GeForce GTX 1660 Ti)
- Use the non-fine-tuned MarianMT and T5 models to get machine translations for a sample from the test set
- Use the fine-tuned models to get machine translations for a sample from the test set
- Calculate BLEU, METEOR and BertScore between references and the target language translations for each the non-fine-tuned and the fine-tuned models
Model | BLEU | METEOR | BertScore1 |
---|---|---|---|
MarianMT (baseline) | 0.256 | 0.433 | 0.597 |
MarianMT (fine-tuned) | 0.388 | 0.552 | 0.717 |
T5-base (baseline) | 0.166 | 0.307 | 0.309 |
T5-base (fine-tuned) | 0.340 | 0.492 | 0.662 |
1: setting the parameter rescale_with_baseline
to True
pytorch==1.7.1
cudatoolkit=10.1
pywin32
transformers
sentence_transformers
faiss-gpu
sacrebleu
datasets
bert-score
lingtrain-aligner
razdel
dateparser
python-dateutil
numpy
openpyxl
All files in this repository which contain text from the books are cut off after the first 50 rows.
The trained model files pytorch_model.bin
and optimizer.pt
for each model are omitted in this repository.