μ-cuDNN

μ-cuDNN is a transparent wrapper for the NVIDIA cuDNN library that splits a minibatch into micro-batches to speed up computation. μ-cuDNN is intended to be combined with deep learning frameworks written in C++, such as Caffe and TensorFlow.

Reference

This repository contains the code used in

Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, μ-cuDNN: Accelerating Deep Neural Networks with Micro-Batching, arXiv e-prints, 2018. [URL]

Please cite as:

@article{ucudnn,
  author    = {Yosuke Oyama and Tal Ben-Nun and Torsten Hoefler and Satoshi Matsuoka},
  title     = {{{\(\mu\)}-cuDNN}: Accelerating Deep Learning Frameworks with Micro-Batching},
  journal   = {CoRR},
  volume    = {abs/1804.04806},
  year      = {2018},
  url       = {http://arxiv.org/abs/1804.04806},
  archivePrefix = {arXiv},
  eprint    = {1804.04806},
}

Requirements

GCC >= 4.8.5 (should support -std=c++11)
CMake >= 3.9.2
CUDA >= 8
cuDNN >= 6
GLPK >= 4.63 (optional)
SQLite >= 3.21 (optional)

Performance

DeepBench

This figure shows the relative speedups of DeepBench's 3x3 and 5x5 convolution layers on the NVIDIA Tesla P100-SXM2 GPU. We use a mini-batch of 256, and workspace limits of 128, 256, and 512 MiB. μ-cuDNN achieves up to speedups of 2.31x for 3x3 layers and 3.85x for 5x5 layers.

CIFAR-10 Training

This figure shows learning curves of a CIFAR-10 CNN defined in Caffe with three different micro-batch policies. We use a mini-batch of 1024, and workspace limit of 64 MiB. The CNN achieves ~80% test accuracy that is similar to the official result.

Installation

Compile μ-cuDNN with CMake:

mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX:PATH="/path/to/ucudnn"
make
make install

Add path to μ-cuDNN:

export CPLUS_INCLUDE_PATH=/path/to/ucudnn/include:$CPLUS_INCLUDE_PATH
export LD_LIBRARY_PATH=/path/to/ucudnn/lib:$LD_LIBRARY_PATH

Modify your deep learning framework:
- Add #include <ucudnn/ucudnn.h> to *.cpp,*.cu,*.h files that contain cudnnHandle_t.
- Replace cudnnHandle_t with UcudnnHandle_t.
Compile the framework.
- You need to link libucudnn.so to the framework explictly by adding the -lucudnn flag.
- In some frameworks you also need to specify the -std=c++11 flag.
- For example, the following CMake flags are needed to compile μ-cuDNN-enabled Caffe:
  - -DCMAKE_SHARED_LINKER_FLAGS="-lucudnn"
  - -DCMAKE_EXE_LINKER_FLAGS="-lucudnn"
  - -DCMAKE_CXX_FLAGS="-std=c++11"

Here you can find μ-cuDNN enabled forks of Caffe and TensorFlow, instead of running steps 3. and 4..

CMake options

Option	Default	Description
`UCUDNN_USE_GLPK`	OFF	Use GNU Linear Programming Kit (GLPK) to run ILP-based optimization.
`UCUDNN_USE_SQLITE`	OFF	Use SQLite to cache benchmark result in file systems.
`UCUDNN_DEBUG_GLPK_PRINT`	OFF	Output ILP information after solving (as `glpk_{sol,mip,prob}_(UNIX time)`) to the current directory. The output formats are based on the `glp_print_sol`, `glp_print_mip`, `glp_write_prob` functions of GLPK respectively.
`UCUDNN_DEBUG_OUTPUT`	ON	Output optimization results to stderr.
`UCUDNN_DEBUG_OUTPUT_ENV`	OFF	Output used environment variables to stderr.
`UCUDNN_DEBUG_EQUIVALENCE_TEST`	OFF	Compute normalized L2-distance between output tensors of cuDNN and μ-cuDNN to check whether these convolutions are equivalent.
`UCUDNN_TEST`	ON	Build tests.

Notes on UCUDNN_DEBUG_EQUIVALENCE_TEST:
- The normalized distance may be larger than zero due to numerical error and nondeterministic behavior of some algorithms.
- In practice the normalized distance should be less than 1e-6.
- Since this option computes the distance every time, it considerably slows down the computation. Please turn off if you want to perform practical training/inference.

Runtime options (environment variables)

Variable	Acceptable values	Default	Description
`UCUDNN_BATCH_SIZE_POLICY`	one of `undivided`,`powerofTwo`,`all`	`powerOfTwo`	μ-batch size policy. This can be set by `UcudnnHandle_t::setOptimizerBatchSizePolicy` as well.
`UCUDNN_BENCHMARK_DEVICES`	comma-separated integers (e.g. `0,1,2,3`) or `all`	The current device of the process	GPU device ID(s) that are used for benchmarking in parallel. Note that optimization will be incorrect if different kinds of GPUs are used at the same time.
`UCUDNN_DEFAULT_WORKSPACE_LIMIT`	integer	`67108864` (i.e. 64 MiB)	The default workspace limit in bytes for convolutional layers. This is used only if the framework tries to get the workspace size before the workspace limit is provided.
`UCUDNN_TOTAL_WORKSPACE_SIZE`	integer	N/A	If this is set, μ-cuDNN will create a static workspace size with the specified size in bytes for ILP-based optimization. Workspace limit passed via conventional cuDNN functions will be ignored.
`UCUDNN_DATABASE`	path	N/A	If this is set, μ-cuDNN will use a SQLite 3 database at the specified path to cache the benchmark results.

locussam / ucudnn