This project aims to create a lightweight framework for DL distributed training based on TensorFlow and Pytorch.
It supports two distribution acceleration approaches:
- Parameter Server (PS)
- Collective communications
Take a look at tips/core/ps and tips/core/collective for more details.
This is a part-time job when I am had a sick leave for fracture. The PS and Collective module themself are well developed and tested, while only the distributed training part is finished evaluation with real TensorFlow resnet50 model.
Currently, this project is hold for lack of time and further motivation.
- openmpi
Download from https://www.open-mpi.org/software/ompi/v4.1/
- ZeroMQ
apt-get install libzmq3-dev
I had read and learned the following projects:
- Horovod for collective modules and TensorFlow and Pytorch support,
- SwiftSnails, my own project that implements a naive PS without MPI.