sadamov / ddp_starter

A most basic working setup for training ML-models with Pytorch using DDP and mulitple nodes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DDP Starter

This is a starter project for distributed deep learning with PyTorch and Slurm.

Pre-requisites

Installation

Copy latest code from github:

git clone git@github.com:sadamov/ddp_starter.git
cd ddp_starter

Create a new conda environment and install dependencies:

mamba env create -f environment.yml

Usage

sbatch test_slurm.sh

Then check out the logs in ./lightning_logs to see if the run was successful. The metrics.csv contains the training and validation losses across all epochs.

`Trainer.fit` stopped: `max_epochs=10` reached.

means that the run was successful.

For real case trainings, you will need to modify the batch_size and num_workers in the dataloader of class IrisDataModule to best utilize the available GPU and CPU resources.

About

A most basic working setup for training ML-models with Pytorch using DDP and mulitple nodes.

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 87.7%Language:Shell 12.3%