liaopeiyuan / heterogeneous-ml-benchmarks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training Memory-Intensive Deep Learning Models with PyTorch’s Distributed Data Parallel

This is a mini-repository for running a ResNet101 model on CIFAR10 dataset using distributed training. Link to the main article can be found here.

Getting Started

Prerequisites

  1. Linux (only tested on Linux)
  2. PyTorch
  3. NVIDIA GPU and CuDNN

Installation

  1. Clone this repository:

    git clone https://github.com/naga-karthik/ddp-resnet-cifar
    cd ddp-resnet-cifar
  2. Download the necessary packages:

    pip install requirements.txt
  3. If you will be running it on a remote server, then it is probably better to pre-download the dataset than actually doing it on-the-fly.

    • CIFAR10 Dataset

    • Create a folder named "data" and move the downloaded dataset into the folder.

Running the model

From the terminal use the following commands to run the model.

  1. With default settings:
    python mainCIFAR10.py
  2. With other options:
    python mainCIFAR10.py --n_epochs=100 --lr=0.001 --batch_size=32

About


Languages

Language:Python 100.0%