FlowSeq: a Generative Flow based Sequence-to-Sequence Tookit.

This is the Pytorch implementation for FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow, accepted by EMNLP 2019.

We propose an efficient and effective model for non-autoregressive sequence generation using latent variable models. We model the complex distributions with generative flows, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. On several machine translation benchmark datasets (wmt14-ende, wmt16-enro), we achieved comparable performance with state-of-the-art non-autoregressive NMT models and almost constant-decoding time w.r.t the sequence length.

Requirements

Python version >= 3.6
Pytorch version >= 1.1
apex
Perl

Installation

Install NVIDIA-apex.
Install Pytorch and torchvision.

Data

WMT'14 English to German (EN-DE) can be obtained with scripts provided in fairseq.
WMT'16 English to Romania (EN-RO) can be obtained from here.

Training a new model

The MT datasets should be named in the format of train.{language code}, dev.{language code}, test.{language code}, e.g "train.de". Suppose we put the WMT14-ENDE data sets under data/wmt14-ende/real-bpe/, we can train FlowSeq over this data on one node with the following script:

cd experiments

python -u distributed.py  \
    --nnodes 1 --node_rank 0 --nproc_per_node <num of gpus per node> --master_addr <address of master node> \
    --master_port <port ID> \
    --config configs/wmt14/config-transformer-base.json --model_path <path to the saved model> \
    --data_path data/wmt14-ende/real-bpe/ \
    --batch_size 2048 --batch_steps 1 --init_batch_size 512 --eval_batch_size 32 \
    --src en --tgt de \
    --lr 0.0005 --beta1 0.9 --beta2 0.999 --eps 1e-8 --grad_clip 1.0 --amsgrad \
    --lr_decay 'expo' --weight_decay 0.001 \
    --init_steps 30000 --kl_warmup_steps 10000 \
    --subword 'joint-bpe' --bucket_batch 1 --create_vocab

After training, under the , there will be saved checkpoints, model.pt, config.json, log.txt, vocab directory and intermediate translation results under the translations directory.

Note:

The argument --batch_steps is used for accumulated gradients to trade speed for memory. The size of each segment of data batch is batch-size / (num_gpus * batch_steps).
To train FlowSeq on multiple nodes, we provide a script for the slurm cluster environment /experiments/slurm.py or please refer to the pytorch distributed parallel training tutorial.
To create distillation dataset, please use fairseq to train a Transformer model and translate the source data set.

Translation and evalutaion

cd experiments

python -u translate.py \
    --model_path <path to the saved model> \
    --data_path data/wmt14-ende/real-bpe/ \
    --batch_size 32 --bucket_batch 1 \
    --decode {'argmax', 'iw', 'sample'} \
    --tau 0.0 --nlen 3 --ntr 1

Please check details of arguments here.

To keep the output translations original order of the input test data, use --bucket_batch 0.

References

@inproceedings{flowseq2019,
    title = {FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow},
    author = {Ma, Xuezhe and Zhou, Chunting and Li, Xian and Neubig, Graham and Hovy, Eduard},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
    address = {Hong Kong},
    month = {November},
    year = {2019}
}

valdersoul / flowseq