Multi-GPU Training with PyTorch: Data and Model Parallelism

About

The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown. This workshop aims to prepare researchers to use the new H100 GPU nodes as part of Princeton Language and Intelligence.

Setup

Make sure you can run Python on Adroit:

$ ssh <YourNetID>@adroit.princeton.edu  # VPN required if off-campus
$ git clone https://github.com/PrincetonUniversity/multi_gpu_training.git
$ cd multi_gpu_training
$ module load anaconda3/2023.9
(base) $ python --version
Python 3.11.5

Getting Help

If you encounter any difficulties with the material in this guide then please send an email to cses@princeton.edu or attend a help session.

Authorship

This guide was created by Mengzhou Xia, Alexander Wettig and Jonathan Halverson. Members of Princeton Research Computing made contributions to this material.

Andrews2017 / multi_gpu_training

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Setup

Getting Help

Authorship

About

Languages