torchbraid

XBraid interface to PyTorch

Build torchbraid (pip):

Optional: create a new virtual environment

python -m venv pip-test
source pip-test/bin/activate
Install using pip. From inside torchbraid directory, do
pip install .

If a development environment is desired, do
pip install -e .
Then all changes in the .py files are directly applicable in the installation. Changes to .pyx files require a re-installation.

You can also install directly from github using
pip install git+ssh://git@github.com/Multilevel-NN/torchbraid.git
or the HTTP equivalent.
Run unit tests (may need to install tox)
tox

The package tox is used for testing in a continuous integration sense and automatically creates and populates a new python environment. However, if you have an environment that already satisfies the dependency requirements you can run the test commands directly using tox-direct.
1. Install tox-direct `pip install tox-direct'
2. Run commands tox --direct
Test run
cd examples/mnist/
mpirun -n 2 python mnist_script.py --percent-data 0.01

Build torchbraid (Makefile):

Requirements:

python libs: cython mpi4py pytorch
build of xbraid
MPI compiler

Conda environments can be found in 'torchbraid/env' directories. These can be used too get a consistent conda enviroonement for using torchbraid. The one caveat, is that mpi4py should be installed consistently with the MPI compiler. In some cases doing a 'pip install mpi4py' is to be preferred to installing it through conda (conda installs an alternate MPI compiler and library. You might want mpi4py to use the native one on your platform).

Note, virtual environments can be used instead of Conda.

Note, the cython version is pretty important, particularly if torch layers are shipped directly by braid.

Setup for Conda (with native MPI support)

conda env create -f ${TORCHBRAID_DIR}/env/py37.env
conda activate py37
MPICC=path/to/mpicc pip install mpi4py

Build xbraid:

Download from git@github.com:XBraid/xbraid.git
The master branch should work fine
From the xbraid directory run make debug=no braid

Build torchbraid

Copy makefile.inc.example to makefile.inc
Modify makefile.inc to include your build specifics
Type make
You will need to add the torchbraid directory to your python path. E.g.:
1. export PYTHONPATH=${PYTHONPATH}:/path/to/torchbraid/src This makes sure that the python search path for modules is setup.

Take look at code in the examples directory.

To test:

make tests
make tests-serial

To clean the directory:

make clean

GPU direct communication

Torchbraid uses direct GPU communication when running simulations on GPUs. For this, Torchbraid requires a CUDA-aware MPI version ( see here or here for more information). A simple first test to determine if your system supports CUDA-aware MPI is to execute the command

ompi_info --parsable -l 9 --all | grep mpi_built_with_cuda_support:value

This command returns a string with true or false at the end. However, in our experiments, it was not always sufficient to check that this value is true. One way to test whether direct GPU communication works on your system is to run:

make tests-direct-gpu

If the test works, your MPI version supports direct GPU communication. If the test throws an error (typically a segmentation fault), your MPI version does not support direct GPU communication.

Publications

Moon, Gordon Euhyun, and Eric C. Cyr. "Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences." ICLR, 2022. Arxiv Link
Cyr, Eric C., Stefanie Günther, and Jacob B. Schroder. "Multilevel Initialization for Layer-Parallel Deep Neural Network Training." arXiv preprint arXiv:1912.08974 (2019). Arxiv Link
Günther, Stefanie, Lars Ruthotto, Jacob B. Schroder, Eric C. Cyr, and Nicolas R. Gauger. "Layer-parallel training of deep residual neural networks." SIAM Journal on Mathematics of Data Science 2, no. 1 (2020): 1-23. Link

eugeneswalker / torchbraid