How to train on multi-GPU？

Question

How to train on multi-GPU？

shuangliumax opened this issue 3 years ago · comments

Hello, because my hardware is limited, there is not enough memory to test on the domain net dataset, so parallel multi-GPU training is required, but it cannot be done according to the code
algorithm = torch.nn.DataParallel(algorithm, device_ids=range(torch.cuda.device_count()))``
Do you have any good suggestions？
thanks！

Junbum Cha · Answer 1 · Tue Dec 21 2021 16:13:02 GMT+0800 (China Standard Time)

Most simply, you can reduce the batch size for the DomainNet (e.g. use B=16 or B=24). This will affect the performance, but I think it is not that significant. Or, if you want to use DataParallel, it seems to pull the model update code from the algorithm.

shuangliumax · Answer 2 · Tue Dec 21 2021 16:22:06 GMT+0800 (China Standard Time)

Thank you for your reply. In fact, I wrote my own method in the algorithm.py. If I use batch_size=16 or 24, would it be unfair compared to the other methods? Or, I need to retest all methods at batch_size=16 or 24, which may be time-consuming and costly. Therefore, in order to alleviate the insufficient memory of a single GPU, I wanted to solve the problem with multiple Gpus, but I tried some methods, but always failed.

Junbum Cha · Answer 3 · Tue Dec 21 2021 17:56:47 GMT+0800 (China Standard Time)

I think you need to separate model update code (including loss backward and optimizer step) from the algorithm, to use DataParallel. Since the codebase is not designed for the multi-gpu originally, there may need several additional modifications.

shuangliumax · Answer 4 · Tue Dec 21 2021 17:59:12 GMT+0800 (China Standard Time)

I think you need to separate model update code (including loss backward and optimizer step) from the algorithm, to use DataParallel. Since the codebase is not designed for the multi-gpu originally, there may need several additional modifications.

Ok, thank you. I'll try.