This is the Pytorch implementation for FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow, accepted by EMNLP 2019.
We propose an efficient and effective model for non-autoregressive sequence generation using latent variable models. We model the complex distributions with generative flows, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. On several machine translation benchmark datasets (wmt14-ende, wmt16-enro), we achieved comparable performance with state-of-the-art non-autoregressive NMT models and almost constant-decoding time w.r.t the sequence length.
- Python version >= 3.6
- Pytorch version >= 1.1
- apex
- Perl
- Install NVIDIA-apex.
- Install Pytorch and torchvision.
- WMT'14 English to German (EN-DE) can be obtained with scripts provided in fairseq.
- WMT'16 English to Romania (EN-RO) can be obtained from here.
The MT datasets should be named in the format of train.{language code}, dev.{language code}, test.{language code}
, e.g "train.de".
Suppose we put the WMT14-ENDE data sets under data/wmt14-ende/real-bpe/
, we can train FlowSeq over this data on one node with the
following script:
cd experiments
python -u distributed.py \
--nnodes 1 --node_rank 0 --nproc_per_node <num of gpus per node> --master_addr <address of master node> \
--master_port <port ID> \
--config configs/wmt14/config-transformer-base.json --model_path <path to the saved model> \
--data_path data/wmt14-ende/real-bpe/ \
--batch_size 2048 --batch_steps 1 --init_batch_size 512 --eval_batch_size 32 \
--src en --tgt de \
--lr 0.0005 --beta1 0.9 --beta2 0.999 --eps 1e-8 --grad_clip 1.0 --amsgrad \
--lr_decay 'expo' --weight_decay 0.001 \
--init_steps 30000 --kl_warmup_steps 10000 \
--subword 'joint-bpe' --bucket_batch 1 --create_vocab
After training, under the , there will be saved checkpoints, model.pt
, config.json
, log.txt
,
vocab
directory and intermediate translation results under the translations
directory.
- The argument --batch_steps is used for accumulated gradients to trade speed for memory. The size of each segment of data batch is batch-size / (num_gpus * batch_steps).
- To train FlowSeq on multiple nodes, we provide a script for the slurm cluster environment
/experiments/slurm.py
or please refer to the pytorch distributed parallel training tutorial. - To create distillation dataset, please use fairseq to train a Transformer model and translate the source data set.
cd experiments
python -u translate.py \
--model_path <path to the saved model> \
--data_path data/wmt14-ende/real-bpe/ \
--batch_size 32 --bucket_batch 1 \
--decode {'argmax', 'iw', 'sample'} \
--tau 0.0 --nlen 3 --ntr 1
Please check details of arguments here.
To keep the output translations original order of the input test data, use --bucket_batch 0
.
@inproceedings{flowseq2019,
title = {FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow},
author = {Ma, Xuezhe and Zhou, Chunting and Li, Xian and Neubig, Graham and Hovy, Eduard},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
address = {Hong Kong},
month = {November},
year = {2019}
}