Wen Zhang, Yang Feng, Fandong Meng, Di You and Qun Liu. Bridging the Gap between Training and Inference for Neural Machine Translation. In Proceedings of ACL, 2019. [paper][code]
Codes in the two directories are the OR-NMT systems based on the RNNsearch and Transformer models correspondingly
- OR-RNNsearch: based on the RNNsearch system which we implemented from scratch
- OR-Transformer: based on the Transformer system fairseq implemented by Facebook
This system has been tested in the following environment.
- OS: Ubuntu 16.04.1 LTS 64 bits
- Python version >=3.6
- Pytorch version >=1.2
For OR-Transformer:
First, go into the OR-Transformer directory.
Then, the training script is the same with fairseq, except for the following arguments:
- add
--use-word-level-oracles
for training Transformer by word-level oracle. - add
--use-sentence-level-oracles
for training Transformer by sentence-level oracle.
By default, the probability is decayed based on the update index.
- add
--use-epoch-numbers-decay
for decaying based on the epoch index. - the hyperparameter
--decay-k
is used to control the speed of the inverse sigmoid decay, which is in Eq.(15) in the paper.- set
8~15
for the decaying based on epoch index - set
3000~8000
for the decaying based on update index - The larger the value, the slower the decay, vice versa.
- set
NOTE: For a new data set, the hyperparameter
--decay-k
needs to be manually adjusted according to the maximum number of training updates (default
) or epochs (--use-epoch-numbers-decay
) to ensure that the probability of sampling golden words does not decay so quickly.
For Eq.(11~13) in the paper, is actually the same as . The operation is not needed in the code implementation.
Gumbel noise:
- add
--use-greed-gumbel-noise
to sample word-level oracle with Gumbel noise - add
--use-bleu-gumbel-noise
to sample sentence-level oracle with Gumbel noise --gumbel-noise
is used as the hyper-parameter in the calculation of Gumbel noise--oracle-search-beam-size
is used to set the beam size in length-constrained decoding
As for the --arch
and --criterion
arguments, oracle_
should be used as the prefix for OR-NMT training, such as:
--arch transformer_vaswani_wmt_en_de_big
->--arch oracle_transformer_vaswani_wmt_en_de_big
--criterion label_smoothed_cross_entropy
->--criterion oracle_label_smoothed_cross_entropy
Example of the script for word-level training and decaying the probability based on epoch index:
export CUDA_VISIBLE_DEVICES=0,1,2,3
batch_size=4096
accum=2
data_dir=directory_of_data_bin
model_dir=./ckpt
python train.py $data_dir \
--arch oracle_transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 \
--weight-decay 0.0 --criterion oracle_label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens $batch_size --update-freq $accum --no-progress-bar --log-format json --max-update 200000 \
--log-interval 10 --save-interval-updates 10000 --keep-interval-updates 10 --save-interval 10000 \
--seed 1111 --skip-invalid-size-inputs-valid-test \
--distributed-port 28888 --distributed-world-size 4 --ddp-backend=no_c10d \
--source-lang en --target-lang de --save-dir $model_dir \
--use-word-level-oracles --use-epoch-numbers-decay --decay-k 10 \
--use-greed-gumbel-noise --gumbel-noise 0.5 | tee -a $model_dir/training.log
Models | Translation Task | #GPUs | #Toks. | #Freq. | Max |
---|---|---|---|---|---|
Transformer-big | Zh->En | 8 | 4096 | 3 | 30 epochs |
+Word-level Oracle | Zh->En | 8 | 4096 | 3 | 30 epochs |
Transformer-base | En->De | 8 | 6144 | 2 | 80000 updates (62 epochs) |
+Word-level Oracle | En->De | 8 | 12288 | 1 | 80000 updates (62 epochs) |
+Sentence-level Oracle | En->De | 8 | 12288 | 1 | 40000 updates (62th epoch -> 93th epoch) |
#Toks. means batchsize on single GPU.
#Freq. means the times of gradient accumulation.
Max represents the maximum number of training epochs (30) or updates (80k).
We calculate the case-insensitive 4-gram tokenized BLEU by script multibleu.perl
We also evaluate by the case-insensitive 4-gram detokenized BLEU with SacreBLEU, which is calculated the script score.py provided by fairseq: BLEU+case.mixed+lang.en-{de,fr}+numrefs.4+smooth.exp+tok.13a+version.1.4.4
The setting of the NIST Chinese->English:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
data_bin_dir=directory_of_data_bin
model_dir=./ckpt
python train.py $data_bin_dir \
--arch oracle_transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 \
--weight-decay 0.0 --criterion oracle_label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 --update-freq 3 --no-progress-bar --log-format json --max-epoch 30 \
--log-interval 10 --save-interval 2 --keep-last-epochs 10 \
--seed 1111 --use-epoch-numbers-decay \
--use-word-level-oracles --decay-k 15 --use-greed-gumbel-noise --gumbel-noise 0.5 \
--distributed-port 32222 --distributed-world-size 8 --ddp-backend=no_c10d \
--source-lang zh --target-lang en --save-dir $model_dir | tee -a $model_dir/training.log
As Eq.(15) in the paper, the probability of sampling golden words decays with the number of epochs as follows:
We calculate the case-sensitive 4-gram tokenized BLEU by script multibleu.perl
Models | newstest2014 | #update |
---|---|---|
Transformer-base | 27.54 | 80000 |
+Word-level Oracle (==50, ==0.8) | 28.01 | 80000 |
+Sentence-level Oracle (==5800, ==0.5, beam_size==4) | 28.45 | 40000 |
We also evaluate by the case-sensitive 4-gram detokenized BLEU with SacreBLEU, which is calculated the script score.py provided by fairseq: BLEU+case.mixed+lang.en-{de,fr}+numrefs.1+smooth.exp+tok.13a+version.1.4.4
Models | newstest2014 | #update |
---|---|---|
Transformer-base | 26.45 | 80000 |
+Word-level Oracle (==50, ==0.8) | 26.86 | 80000 |
+Sentence-level Oracle (==5800, ==0.5, beam_size==4) | 27.24 | 40000 |
Setting of the word-level oracle for the WMT'14 English->German dataset:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
data_bin_dir=directory_of_data_bin
model_dir=./ckpt
python train.py $data_bin_dir \
--arch oracle_transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 \
--weight-decay 0.0 --criterion oracle_label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 12288 --update-freq 1 --no-progress-bar --log-format json --max-update 80000 \
--log-interval 10 --save-interval-updates 4000 --keep-interval-updates 10 --save-interval 10000 \
--seed 1111 --use-epoch-numbers-decay \
--use-word-level-oracles --decay-k 50 --use-greed-gumbel-noise --gumbel-noise 0.8 \
--distributed-port 31111 --distributed-world-size 8 --ddp-backend=no_c10d \
--source-lang en --target-lang de --save-dir $model_dir | tee -a $model_dir/training.log
As Eq.(15) in the paper, the probability of sampling golden words decays with the number of epochs as follows:
In order to save training time, we use the sentence-level oracle method to finetune the best base model.
Setting of the sentence-level oracle for the WMT'14 English->German dataset:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
data_bin_dir=directory_of_data_bin
model_dir=./ckpt
python train.py $data_bin_dir \
--arch oracle_transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 \
--weight-decay 0.0 --criterion oracle_label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 12288 --update-freq 1 --no-progress-bar --log-format json --max-update 40000 \
--log-interval 10 --save-interval-updates 2000 --keep-interval-updates 10 --save-interval 10000 \
--seed 1111 --reset-optimizer --reset-meters \
--use-sentence-level-oracles --decay-k 5800 --use-bleu-gumbel-noise --gumbel-noise 0.5 --oracle-search-beam-size 4 \
--distributed-port 31111 --distributed-world-size 8 --ddp-backend=no_c10d \
--source-lang en --target-lang de --save-dir $model_dir | tee -a $model_dir/training.log
As Eq.(15) in the paper, the probability of sampling golden words decays with the number of udpates as follows:
- The speed of word-level training is almost the same as original transformer.
- Sentence-level training is slower than word-level training.
--use-epoch-numbers-decay
and--decay-k
need to be adapted on different training data.- The
prob
field in the training log means the decay probability of sampling golden words.
Test training speed and GPU memory usage on iwslt de2en training set
Model Name | Memory Usage (G) | Training Speed (upd/s) |
---|---|---|
Transformer | 4.39 | 2.65 |
Word-level training | 4.57 | 2.25 |
Sentence-level training (decay_prob=1, beam_size=4) | 4.75 | 0.59 |
please cite as:
@inproceedings{zhang2019bridging,
title = "Bridging the Gap between Training and Inference for Neural Machine Translation",
author = "Zhang, Wen and Feng, Yang and Meng, Fandong and You, Di and Liu, Qun",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1426",
doi = "10.18653/v1/P19-1426",
pages = "4334--4343",
}
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
- March 2020: Byte-level BPE code released
- February 2020: mBART model and code released
- February 2020: Added tutorial for back-translation
- December 2019: fairseq 0.9.0 released
- November 2019: VizSeq released (a visual analysis toolkit for evaluating fairseq models)
- November 2019: CamemBERT model and code released
- November 2019: BART model and code released
- November 2019: XLM-R models and code released
- September 2019: Nonautoregressive translation code released
- August 2019: WMT'19 models released
- July 2019: fairseq relicensed under MIT license
- July 2019: RoBERTa models and code released
- June 2019: wav2vec models and code released
Fairseq provides reference implementations of various sequence-to-sequence models, including:
- Convolutional Neural Networks (CNN)
- Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)
- Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
- Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)
- Hierarchical Neural Story Generation (Fan et al., 2018)
- wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)
- LightConv and DynamicConv models
- Long Short-Term Memory (LSTM) networks
- Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)
- Transformer (self-attention) networks
- Attention Is All You Need (Vaswani et al., 2017)
- Scaling Neural Machine Translation (Ott et al., 2018)
- Understanding Back-Translation at Scale (Edunov et al., 2018)
- Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)
- Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)
- Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)
- Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)
- Multilingual Denoising Pre-training for Neural Machine Translation (Liu et at., 2020)
- Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)
- Non-autoregressive Transformers
- Non-Autoregressive Neural Machine Translation (Gu et al., 2017)
- Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018)
- Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019)
- Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019)
- Levenshtein Transformer (Gu et al., 2019)
Additionally:
- multi-GPU (distributed) training on one machine or across multiple machines
- fast generation on both CPU and GPU with multiple search algorithms implemented:
- beam search
- Diverse Beam Search (Vijayakumar et al., 2016)
- sampling (unconstrained, top-k and top-p/nucleus)
- large mini-batch training even on a single GPU via delayed updates
- mixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)
- extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers
We also provide pre-trained models for translation and language modeling
with a convenient torch.hub
interface:
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model')
en2de.translate('Hello world', beam=5)
# 'Hallo Welt'
See the PyTorch Hub tutorials for translation and RoBERTa for more examples.
- PyTorch version >= 1.4.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
- For faster training install NVIDIA's apex library with the
--cuda_ext
and--deprecated_fused_adam
options
To install fairseq:
pip install fairseq
On MacOS:
CFLAGS="-stdlib=libc++" pip install fairseq
If you use Docker make sure to increase the shared memory size either with
--ipc=host
or --shm-size
as command line options to nvidia-docker run
.
Installing from source
To install fairseq from source and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .
The full documentation contains instructions for getting started, training new models and extending fairseq with new model types and tasks.
We provide pre-trained models and pre-processed, binarized test sets for several tasks listed below, as well as example training and evaluation commands.
- Translation: convolutional and transformer models are available
- Language Modeling: convolutional and transformer models are available
- wav2vec: wav2vec large model is available
We also have more detailed READMEs to reproduce results from specific papers:
- Neural Machine Translation with Byte-Level Subwords (Wang et al., 2020)
- Jointly Learning to Align and Translate with Transformer Models (Garg et al., 2019)
- Levenshtein Transformer (Gu et al., 2019)
- Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)
- wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)
- Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
- Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019)
- Understanding Back-Translation at Scale (Edunov et al., 2018)
- Classical Structured Prediction Losses for Sequence to Sequence Learning (Edunov et al., 2018)
- Hierarchical Neural Story Generation (Fan et al., 2018)
- Scaling Neural Machine Translation (Ott et al., 2018)
- Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
- Language Modeling with Gated Convolutional Networks (Dauphin et al., 2017)
- Facebook page: https://www.facebook.com/groups/fairseq.users
- Google group: https://groups.google.com/forum/#!forum/fairseq-users
fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well.
Please cite as:
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}