This repository contains code samples to help train and optimise a custom translation model on Azure, see Custom Translator Documentation for more info.
This repo contains the following:
Stage | Scenario | Description |
---|---|---|
Analysis | Creating datasets | Cleaning Translation memory files and generate train/test/tune datasets |
Analysis | Generate Phrase Dictionary | Illustrates unsupervised approaches to building a Phrase Dictionary |
Evaluation | Translator pipeline | A full source to target language multi-model evaluation pipeline |
Evaluation | Evaluate and compare model results | Aggregate and compare all model results |
The order in which to run these scripts is as follows:
- Start by creating your datasets
- Train your models on these datasets using different projects per language
- Evaluate all models against your test document set
- Select the best model - include human evaluation
- Generate a Phrase dictionary using your best model
- Consider creating a stylistic accurate tuning dataset
- Retrain the model using the Phrase Dictionary and optimised tuning set
The EAC_FORMS and EAC_REFRENCE sample data used in this repo is drawn from the EAC-Translation Memory Language Technology Resources released courtesy of the European Union's (EU) Directorate General for Education and Culture. It is © European Union and is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence.