jerinphilip / UCE4BT

Uncertainty-based Confidence Estimation for Back-Translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improving Back-Translation with Uncertainty-based Confidence Estimation

Contents

Introduction

This is the implementation of our work 'Improving Back-Translation with Uncertainty-based Confidence Estimation' (EMNLP 2019). The implementation is on top of THUMT.

Prerequisites

This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.

Usage

Note: The usage is not user-friendly. May improve later.
Suppose the local path to this repository is CODE_DIR.

  1. Standard training:
python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--side none \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.

  1. Translate target-side monolingual corpus:
python [CODE_DIR]/thumt/bin/translator.py \
	--input [monolingual corpus] \
	--output [translated corpus] \
	--vocabulary [target vocabulary] [source vocabulary] \
	--model transformer \
	--checkpoint [path to the target-source model] \
	--parameters=device_list=[0]

We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.

  1. Confidence-aware training:
python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--word_confidence [word-level uncertainty file] \
	--sen_confidence [sentence-level uncertainty file] \
	--side source_sentence_source_word \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--checkpoint [path to the source-target checkpoint] \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

Contact

If you have questions, suggestions and bug reports, please email wangshuo18@mails.tsinghua.edu.cn.

About

Uncertainty-based Confidence Estimation for Back-Translation

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%