junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi GPU speed ?

gabgren opened this issue · comments

Hi!

I was under the assumption that using multiple gpus for training pix2pix would result in a faster training, but this is not what I am experiencing. In fact I get slower speeds, the best I can do is keeping the s/it more or less the same as with 1 gpu.

For testing, I was using batch_size 8 for single gpu and batch_size 64 for 8 gpus. Tests were done on 8x A6000 and 8x 3090. I have also tested setting norm to instance and batch with no effect.

What am I doing wrong, or getting wrong ? Am I right to expect to train faster with more GPUs or is it that by using multiple gpu_ids i get to train higher resolution ?

Thanks !

Could you check if the GPU utilization is at 100%? It could be because the data loader does not feed training images fast enough. Another possibility is that the progress in the total number of images used for training is actually faster with more GPUs, but if you are monitoring the number of iterations, it won't be different.

Looks like its your first theory: it takes a long time feeding the 8 gpus. the actual processing seems to be faster, but is slowed down between iterations. See this comparison between the GPU utilization of 1xA6000 vs 8xA6000:
1gpu
8gpus

How can I speed this up ?

it might be a data loading issue. You may want to use SSD or other fast file systems.

I have 4 GPUs and want to use these 4 GPUs for accelerated training at the same time, how can I modify the code? At present, it can only be trained on one GPU, and the training speed is very slow, thank you!

@gabgren I have 4 GPUs and want to use these 4 GPUs for accelerated training at the same time, how can I modify the code? At present, it can only be trained on one GPU, and the training speed is very slow, --gpu_ids 0,1,2,3 does not work,thank you!

What is your batch_size? By mentioning "does not work", are you referring to (1) the model is only trained on one GPU, or (2) the model is trained on multiple GPUs, but the training speed is not as fast as you expect?

@junyanz batch_size is "4", after the use of --gpu_ids 0,1,2,3, the model is only trained on one GPU

This could be because of the limitation of nn.DataParallel we use here, which was a common approach when we published the git repo. But it does suffer from suboptimal GPU utilization because the data loading is inefficient. A better way would be utilizing DistributedDataParallel link. We don't plan to support this for now, but if someone could create a PR I'd appreciate it.