liaopeiyuan / heterogeneous-ml-benchmarks

Training Memory-Intensive Deep Learning Models with PyTorch’s Distributed Data Parallel

This is a mini-repository for running a ResNet101 model on CIFAR10 dataset using distributed training. Link to the main article can be found here.

Getting Started

Prerequisites

Linux (only tested on Linux)
PyTorch
NVIDIA GPU and CuDNN

Installation

Clone this repository:

git clone https://github.com/naga-karthik/ddp-resnet-cifar
cd ddp-resnet-cifar

Download the necessary packages:
```
pip install requirements.txt
```
If you will be running it on a remote server, then it is probably better to pre-download the dataset than actually doing it on-the-fly.
- CIFAR10 Dataset
- Create a folder named "data" and move the downloaded dataset into the folder.

Running the model

From the terminal use the following commands to run the model.

With default settings:
```
python mainCIFAR10.py
```

With other options:

python mainCIFAR10.py --n_epochs=100 --lr=0.001 --batch_size=32

About

Languages

Language:Python 100.0%