seq2seq

Attention-based sequence to sequence learning

Dependencies

TensorFlow 1.2+ for Python 3
YAML and Matplotlib modules for Python 3: sudo apt-get install python3-yaml python3-matplotlib
A recent NVIDIA GPU

How to use

Train a model (CONFIG is a YAML configuration file, such as config/default.yaml):

./seq2seq.sh CONFIG --train -v

Translate text using an existing model:

./seq2seq.sh CONFIG --decode FILE_TO_TRANSLATE --output OUTPUT_FILE

or for interactive decoding:

./seq2seq.sh CONFIG --decode

Example English→French model

This is the same model and dataset as Bahdanau et al. 2015.

config/WMT14/download.sh    # download WMT14 data into raw_data/WMT14
config/WMT14/prepare.sh     # preprocess the data, and copy the files to data/WMT14
./seq2seq.sh config/WMT14/baseline.yaml --train -v   # train a baseline model on this data

You should get similar BLEU scores as these (our model was trained on a single Titan X I for about 4 days).

Dev	Test	+beam	Steps	Time
25.04	28.64	29.22	240k	60h
25.25	28.67	29.28	330k	80h

Download this model here. To use this model, just extract the archive into the seq2seq/models folder, and run:

 ./seq2seq.sh models/WMT14/config.yaml --decode -v

Example German→English model

This is the same dataset as Ranzato et al. 2015.

config/IWSLT14/prepare.sh
./seq2seq.sh config/IWSLT14/baseline.yaml --train -v

Dev	Test	+beam	Steps
28.32	25.33	26.74	44k

The model is available for download here.

Audio pre-processing

If you want to use the toolkit for Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST), then you'll need to pre-process your audio files accordingly. This README details how it can be done. You'll need to install the Yaafe library, and use scripts/speech/extract-audio-features.py to extract MFCCs from a set of wav files.

Features

YAML configuration files
Beam-search decoder
Ensemble decoding
Multiple encoders
Hierarchical encoder
Bidirectional encoder
Local attention model
Convolutional attention model
Detailed logging
Periodic BLEU evaluation
Periodic checkpoints
Multi-task training: train on several tasks at once (e.g. French->English and German->English MT)
Subwords training and decoding
Input binary features instead of text
Pre-processing script: we provide a fully-featured Python script for data pre-processing (vocabulary creation, lowercasing, tokenizing, splitting, etc.)
Dynamic RNNs: we use symbolic loops instead of statically unrolled RNNs. This means that we don't mean to manually configure bucket sizes, and that model creation is much faster.

Credits

This project is based on TensorFlow's reference implementation
We include some of the pre-processing scripts from Moses
The scripts for subword units come from github.com/rsennrich/subword-nmt

alex-berard / seq2seq