shanepeckham / custom-machine-translation-recipes

Custom Translation Accelerators & Examples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom machine translation recipes

Python package

This repository contains code samples to help train and optimise a custom translation model on Azure, see Custom Translator Documentation for more info.

Overview

This repo contains the following:

Stage Scenario Description
Analysis Creating datasets Cleaning Translation memory files and generate train/test/tune datasets
Analysis Generate Phrase Dictionary Illustrates unsupervised approaches to building a Phrase Dictionary
Evaluation Translator pipeline A full source to target language multi-model evaluation pipeline
Evaluation Evaluate and compare model results Aggregate and compare all model results

Getting Started

The order in which to run these scripts is as follows:

  1. Start by creating your datasets
  2. Train your models on these datasets using different projects per language
  3. Evaluate all models against your test document set
  4. Select the best model - include human evaluation
  5. Generate a Phrase dictionary using your best model
  6. Consider creating a stylistic accurate tuning dataset
  7. Retrain the model using the Phrase Dictionary and optimised tuning set

Acknowledgements

The EAC_FORMS and EAC_REFRENCE sample data used in this repo is drawn from the EAC-Translation Memory Language Technology Resources released courtesy of the European Union's (EU) Directorate General for Education and Culture. It is © European Union and is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

About

Custom Translation Accelerators & Examples

License:MIT License


Languages

Language:Perl 75.9%Language:Python 15.3%Language:Jupyter Notebook 8.0%Language:PowerShell 0.8%