Custom machine translation recipes

This repository contains code samples to help train and optimise a custom translation model on Azure, see Custom Translator Documentation for more info.

Overview

This repo contains the following:

Stage	Scenario	Description
Analysis	Creating datasets	Cleaning Translation memory files and generate train/test/tune datasets
Analysis	Generate Phrase Dictionary	Illustrates unsupervised approaches to building a Phrase Dictionary
Evaluation	Translator pipeline	A full source to target language multi-model evaluation pipeline
Evaluation	Evaluate and compare model results	Aggregate and compare all model results

Getting Started

The order in which to run these scripts is as follows:

Start by creating your datasets
Train your models on these datasets using different projects per language
Evaluate all models against your test document set
Select the best model - include human evaluation
Generate a Phrase dictionary using your best model
Consider creating a stylistic accurate tuning dataset
Retrain the model using the Phrase Dictionary and optimised tuning set

Acknowledgements

The EAC_FORMS and EAC_REFRENCE sample data used in this repo is drawn from the EAC-Translation Memory Language Technology Resources released courtesy of the European Union's (EU) Directorate General for Education and Culture. It is © European Union and is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

About

Custom Translation Accelerators & Examples

MIT License

Languages

Language:Perl 75.9%Language:Python 15.3%Language:Jupyter Notebook 8.0%Language:PowerShell 0.8%