This project aims to create a lightweight framework for DL distributed training based on TensorFlow and Pytorch.

Two acceleration modes

It supports two distribution acceleration approaches:

Parameter Server (PS)
Collective communications

Take a look at tips/core/ps and tips/core/collective for more details.

Current status

This is a part-time job when I am had a sick leave for fracture. The PS and Collective module themself are well developed and tested, while only the distributed training part is finished evaluation with real TensorFlow resnet50 model.

Currently, this project is hold for lack of time and further motivation.

dependencies

openmpi

Download from https://www.open-mpi.org/software/ompi/v4.1/

ZeroMQ

apt-get install libzmq3-dev

References

I had read and learned the following projects:

Horovod for collective modules and TensorFlow and Pytorch support,
SwiftSnails, my own project that implements a naive PS without MPI.

About

A distributed training framework for DNN with support of both PS and Collective ways.

Languages

Language:C++ 61.6%Language:Python 30.6%Language:CMake 6.9%Language:C 0.7%Language:Shell 0.2%