Awesome Distributed Deep Learning

A curated list of awesome Distributed Deep Learning resources.

Frameworks

Blogs

Papers

Frameworks

MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript and more.
go-mxnet-predictor - Go binding for MXNet c_predict_api to do inference with pre-trained model.
deeplearning4j - Distributed Deep Learning Platform for Java, Clojure, Scala.
Distributed Machine learning Tool Kit (DMTK) - A distributed machine learning (parameter server) framework by Microsoft. Enables training models on large data sets across multiple machines. Current tools bundled with it include: LightLDA and Distributed (Multisense) Word Embedding.
Elephas - Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark.
Horovod - Distributed training framework for TensorFlow.

Blogs

Papers

General:

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis:discusses the different types of concurrency in DNNs; synchronous and asynchronous stochastic gradient descent; distributed system architectures; communication schemes; and performance modeling. Based on these approaches, it also extrapolates the potential directions for parallelism in deep learning.

Model Consistency:

Synchronization:

Synchronous techniques:

Deep learning with COTS HPC systems: Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology, a cluster of GPU servers with Infiniband interconnects and MPI.
FireCaffe: near-linear acceleration of deep neural network training on compute clusters : The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule. Therefore, the key consideration this paper makes is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train.
SparkNet: Training Deep Networks in Spark. In Proceedings of the International Conference on Learning Representations (ICLR).
1-Bit SGD: 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014.
Scalable Distributed DNN Training Using Commodity GPU Cloud Computing:It introduces a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The method solves the well-known communication bottleneck problem that arises for data-parallel SGD because compute nodes frequently need to synchronize a replica of the model.
Multi-GPU Training of ConvNets.: Training of ConvNets on multiple GPU's

Stale-Synchronous techniques:

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study.
A Fast Learning Algorithm for Deep Belief Nets.:A fast learning algorithm for deep belief nets
Heterogeneity-aware Distributed Parameter Servers.: J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proc. 2017 ACM International Conference on Management of Data (SIGMOD ’17). 463–478.
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization:X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. In Proc. 28th Int’l Conf. on NIPS - Volume 2. 2737–2745.
Staleness-Aware Async-SGD for Distributed Deep Learning: W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for Distributed Deep Learning. In Proc. Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16). 2350–2356.

Asynchronous techniques:

A Unified Analysis of HOGWILD!-style Algorithms.: C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the Wild: A Unified Analysis of HOGWILD!-style Algorithms. In Proc. 28th Int’l Conf. on NIPS - Volume 2. 2674–2682.
Large Scale Distributed Deep Networks: J. Dean et al. 2012. Large Scale Distributed Deep Networks. In Proc. 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). 1223–1231.
Asynchronous Parallel Stochastic Gradient Descent:J. Keuper and F. Pfreundt. 2015. Asynchronous Parallel Stochastic Gradient Descent: A Numeric Core for Scalable Distributed Machine Learning Algorithms. In Proc. Workshop on MLHPC. 1:1–1:11.
Dogwild!-Distributed Hogwild for CPU & GPU.: C. Noel and S. Osindero. 2014. Dogwild!-Distributed Hogwild for CPU & GPU. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.
GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training.: T. Paine, H. Jin, J. Yang, Z. Lin, and T. S. Huang. 2013. GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training. (2013). arXiv:1312.6186
HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent: B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24. 693–701.
Asynchronous stochastic gradient descent for DNN training: S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In IEEE International Conference on Acoustics, Speech and Signal Processing. 6660–6663.

Non-Deterministic Communication:

GossipGraD:Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent
How to scale distributed deep learning: How to scale distributed deep learning?
Heterogeneity-aware Distributed Parameter Servers: a study of distributed machine learning in heterogeneous environments.

Parameter Distribution and Communication:

Centralization:

Parameter Server (PS):

GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-specialized Parameter. Server.
FireCaffe: F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
DeepSpark: H. Kim et al. 2016. Spark-Based Deep Learning Supporting Asynchronous Updates and Caffe Compatibility. (2016).
Scaling Distributed Machine Learning with the Parameter Server: M. Li et al. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583–598.

Sharded PS:

Project Adam:T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation. 571–582.
Large Scale Distributed Deep Networks: J. Dean et al. 2012. Large Scale Distributed Deep Networks. In Proc. 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). 1223–1231.
Heterogeneity-aware Distributed Parameter Servers: J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proc. 2017 ACM International Conference on Management of Data (SIGMOD ’17). 463–478.
Building High-level Features Using Large Scale Unsupervised Learning: Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building High-level Features Using Large Scale Unsupervised Learning. In Proc. 29th Int’l Conf. on Machine Learning (ICML’12). 507–514.
Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data: T. Kurth et al. 2017. Deep Learning at 15PF: Supervised and Semi-supervised Classification for Scientific Data. In Proc. Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC ’17). 7:1–7:11.
Petuum: E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. IEEE Transactions on Big Data 1, 2 (2015), 49–67.
Poseidon: H. Zhang, Z. Hu, J. Wei, P. Xie, G. Kim, Q. Ho, and E. P. Xing. 2015. Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines. (2015). arXiv:1512.06216

Hierarchical PS:

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:A Systematic Study : S. Gupta, W. Zhang, and F. Wang. 2016. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study. In IEEE 16th International Conference on Data Mining (ICDM). 171–180.
gaia: K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-distributed Machine Learning Approaching LAN Speeds. In Proc. 14th USENIX Conf. on NSDI. 629–647.
Using Supercomputer to Speed up Neural Network Training: Y. Yu, J. Jiang, and X. Chi. 2016. Using Supercomputer to Speed up Neural Network Training. In IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). 942–947.

Decentralized:

Compression:

Quantization:

Sparsification:

Other Methods:

Training Distribution:

Model Consolidation:

Ensemble Learning:

Knowledge Distillation:

Model Averaging:

Direct:

Elastic:

Natural Gradient:

Optimization Algorithms:

First-Order:

Second-Order:

Evolutionary:

Hyper-Parameter Search:

Architecture Search:

Reinforcement:

Evolutionary:

SMBO:

Feedback: If you have any ideas or you want any other content to be added to this list, feel free to contribute.

Awesome Distributed Deep Learning

Table of Contents

Frameworks

Blogs

Papers

Frameworks

Blogs

Papers

General:

Model Consistency:

Synchronization:

Synchronous techniques:

Stale-Synchronous techniques:

Asynchronous techniques:

Non-Deterministic Communication:

Parameter Distribution and Communication:

Centralization:

Parameter Server (PS):

Sharded PS:

Hierarchical PS:

Decentralized:

Compression:

Quantization:

Sparsification:

Other Methods:

Training Distribution:

Model Consolidation:

Ensemble Learning:

Knowledge Distillation:

Model Averaging:

Direct:

Elastic:

Natural Gradient:

Optimization Algorithms:

First-Order:

Second-Order:

Evolutionary:

Hyper-Parameter Search:

Architecture Search:

Reinforcement:

Evolutionary:

SMBO:

About