TFCudnnLSTM

A simple template for TensorFlow's highly efficient CudnnLSTM module

Dependencies

TensorFlow v1.8+
CUDA v9.0+
cuDNN v7.0+
scikit-learn
tqdm

Computational Performance

TensorFlow's performance guide includes a section on RNN performance, which states:

On NVIDIA GPUs, the use of tf.contrib.cudnn_rnn should always be preferred unless you want layer normalization, which it doesn't support.

According to this benchmark result by RETURNN, CudnnLSTM achieves significant speedups compared to TensorFlow's other LSTM implementations (~2x faster than LSTMBlockFused and ~5x faster than BasicLSTM).

Language Modeling Experiments

We also took the tutorial code for PTB language modeling and tried running the three versions of LSTM implemented there: BasicLSTMCell, LSTMBlockCell, and CudnnLSTM. We found that the CudnnLSTM example does not run in TF v1.8 due to API changes, but after fixing minor issues we were able to run it on a single GPU. The benchmark results we got running the "large" model are as follows:

Module	Average wps*	Speedup w.r.t. `BasicLSTMCell`
`BasicLSTMCell`	15k	1x
`LSTMBlockCell`	17k	1.1x
`CudnnLSTM`	32k	2.1x

*wps refers to the number of processed words per second.

In all three cases, we used a single NVIDIA Tesla P40 GPU, which was utilized 80-85% (100% memory) during training. The tutorial code only supports multi-GPU training using BasicLSTMCell, and using 2 P40 GPUs we got approximately 25k wps (1.7x speedup w.r.t. single-GPU BasicLSTMCell, but still 22% slower than a single-GPU CudnnLSTM.)

Caveats

We did not test the handling of variable-length sequences per batch for CudnnLSTM, but there seem to be some issues (e.g., see #6633). Bucketing could be a useful (but not perfect) workaround for this problem.

CudnnLSTM does not support layer normalization, because cuDNN itself does not support it.

Comparisons with PyTorch and Keras

PyTorch's built-in nn.LSTM module already supports CUDNN integration (!), as shown here and here. For one thing, PyTorch's nn.LSTM is not a contrib module with little documentation.

While we leave a rigorous comparison between PyTorch's nn.LSTM and TensorFlow's cudnn_rnn.CudnnLSTM as future work, it does appear that PyTorch's version is as efficient as but more stable than TensorFlow's counterpart. When we tried running PyTorch's own LSTM language modeling example, using nearly the same set of parameters (2 layers, 1.5k hidden size, 35k vocab size, 20 batch size and 35 timesteps), we got around 100 milliseconds per batch on a single P40 GPU (95+% utilization). For the aforementioned tutorial code from TensorFlow, we got around 120 milliseconds per batch on the same machine (95+% utilization).

So, if you're already a PyTorch user and your system is built on PyTorch, there's little reason to switch to using TF's CudnnLSTM for performance, at least for now.

Keras also has a similar-looking module that was introduced last year. We did not test it, but it appears to have a nice documentation.

Authors

YJ Choe

yjchoe / TFCudnnLSTM