Distributed Training in PyTorch

There are some distributed training steps you can try according to PyTorch Document.

PyTorch provides several options for data-parallel training. For applications that gradually grow from simple to complex and from prototype to production, the common development trajectory would be:

Use single-device training, if the data and model can fit in one GPU, and the training speed is not a concern.

Use single-machine multi-GPU DataParallel, if there are multiple GPUs on the server, and you would like to speed up training with the minimum code change. Use single-machine multi-GPU DistributedDataParallel, if you would like to further speed up training and are willing to write a little more code to set it up.

Use multi-machine DistributedDataParallel and the launching script, if the application needs to scale across machine boundaries.

Use torchelastic to launch distributed training, if errors (e.g., OOM) are expected or if the resources can join and leave dynamically during the training.

In this repo, I compared single-device(1) with single-machine multi-GPU DataParallel(2) and single-machine multi-GPU DistributedDataParallel.

Environment

Nvidia RTX 2080ti * 2
torch==1.7.1
torchvision==0.8.2

All dependencies are written in requirements.txt, and you can also access through Dockerfile.

How to Run

All three folders - src/single/, src/dp/, and src/ddp/ - are independent structures.

Single

$ sh src/single/run_single.sh

DataParallel

$ sh src/dp/run_dp.sh

DistributedDataParallel

$ sh src/ddp/run_ddp.sh

Result

Batch size is set to 128 or 256. It is recommended to use SyncBatchNorm in DDP training, but I used vanila BatchNorm so just trained on 256 batch size. Best model is selected according to validation top-1 accuracy.

And I did not care detailed hyperparameter settings, so you can change some settings in order to improve performance (e.g. using ADAM optimizer).

Dataset	Model	Test Loss	Top-1 Acc	Top-5 Acc	Batch Size	Method
CIFAR-100	ResNet-18	1.3728	70.99%	91.57%	128	Single
CIFAR-100	ResNet-18	1.3394	70.64%	91.60%	256	Single
CIFAR-100	ResNet-18	1.2974	71.48%	91.65%	128	DataParallel (DP)
CIFAR-100	ResNet-18	1.3373	71.20%	91.53%	256	DataParallel (DP)
CIFAR-100	ResNet-18	1.2268	71.17%	91.84%	256	DistributedDataParallel (DDP)

Experiment results are averaged value of random seed 2, 4, 42.
Automatic Mixed Precision(AMP) is applied to every experiment.

Reference

About

PyTorch distributed training comparison

pytorch dp ddp distributed

Apache License 2.0

Languages

Language:Python 97.2%Language:Shell 2.0%Language:Dockerfile 0.9%