vineeths96 / Heterogeneous-Systems

We present an algorithm to dynamically adjust the data assigned for each worker at every epoch during the training in a heterogeneous cluster. We empirically evaluate the performance of the dynamic partitioning by training deep neural networks on the CIFAR10 dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Language Contributors Forks Stargazers Issues MIT License LinkedIn


Distributed Optimization using Heterogeneous Compute Systems

Explore the repository»
View Paper

tags : distributed optimization, large-scale machine learning, heterogenous systems, edge learning, federated learning, deep learning, pytorch

About The Project

Hardware compute power has been growing at an unprecedented rate in recent years. The utilization of such advancements plays a key role in producing better results in less time -- both in academia and industry. However, merging the existing hardware with the latest hardware within the same ecosystem poses a challenging task. One of the key challenges, in this case, is varying compute power. In this paper, we consider the training of deep neural networks on a distributed system of workers with varying compute power. A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing. To mitigate this issue, we propose to dynamically adjust the data assigned for each worker at every epoch during the training. We assign each worker a partition of total data proportional to its computing power. By adjusting the data partition to the workers, we directly control the workload on the workers. We assign the partitions, and hence the workloads, such that the time taken to process the data partition is almost uniform across the workers. We empirically evaluate the performance of the dynamic partitioning by training deep neural networks on the CIFAR10 dataset. We examine the performance of training ResNet50 (computation-heavy) model and VGG16 (computation-light) model with and without the dynamic partitioning algorithms. Our experiments show that dynamically adjusting the data partition helps to improve the utilization of the system and significantly reduces the time taken for training.

Built With

This project was built with

  • python v3.7.6
  • PyTorch v1.7.1
  • The environment used for developing this project is available at environment.yml.

Getting Started

Clone the repository into a local machine using,

git clone https://github.com/vineeths96/Heterogeneous-Systems
cd Heterogeneous-Systems/

Prerequisites

Create a new conda environment and install all the libraries by running the following command

conda env create -f environment.yml

The dataset used in this project (CIFAR 10) will be automatically downloaded and setup in data directory during execution.

Instructions to run

The training of the models can be performed on a distributed cluster with multiple machines and multiple worker GPUs. We make use of torch.distributed.launch to launch the distributed training. More information is available here.

To launch distributed training on a single machine with multiple workers (GPUs),

python -m torch.distributed.launch --nproc_per_node=<num_gpus> trainer.py --local_world_size=<num_gpus> 

To launch distributed training on multiple machine with multiple workers (GPUs),

export NCCL_SOCKET_IFNAME=ens3

python -m torch.distributed.launch --nproc_per_node=<num_gpus> --nnodes=<num_machines> --node_rank=<node_rank> --master_addr=<master_address> --master_port=<master_port> trainer.py --local_world_size=<num_gpus>

Model overview

We conducted experiments on ResNet50 architecture and VGG16 architecture. Refer the original papers for more information about the models. We use publicly available implementations from GitHub for reproducing the models.

Results

We highly recommend to read through the paper before proceeding to this section. The paper explains the dynamic partitioning schemes we propose and contains many more analysis & results than what is presented here.

We begin with an explanation of the notations used for the plot legends in this section. Sync-SGD corresponds to the default gradient aggregation provided by PyTorch. DP-SGD and EDP-SGD corresponds to Dynamic Partitioning and Enhanced Dynamic Partitioning respectively. We artificially simulate heterogeneity by adding time delays to a subset of workers. We evaluate the algorithms for a low level of heterogeneity and a high level of heterogeneity.

ResNet50 VGG16
LossLoss Curve LossLoss Curve
ARWithout Dynamic Partitioning ARWithout Dynamic Partitioning
DPWith Dynamic Partitioning DPWith Dynamic Partitioning

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - vs96codes@gmail.com

Project Link: https://github.com/vineeths96/Heterogeneous-Systems

About

We present an algorithm to dynamically adjust the data assigned for each worker at every epoch during the training in a heterogeneous cluster. We empirically evaluate the performance of the dynamic partitioning by training deep neural networks on the CIFAR10 dataset.

License:MIT License


Languages

Language:Python 100.0%