khanrc / swad

Official Implementation of SWAD (NeurIPS 2021)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to train on multi-GPU?

shuangliumax opened this issue · comments

Hello, because my hardware is limited, there is not enough memory to test on the domain net dataset, so parallel multi-GPU training is required, but it cannot be done according to the code
algorithm = torch.nn.DataParallel(algorithm, device_ids=range(torch.cuda.device_count()))``
Do you have any good suggestions?
thanks!

Most simply, you can reduce the batch size for the DomainNet (e.g. use B=16 or B=24). This will affect the performance, but I think it is not that significant. Or, if you want to use DataParallel, it seems to pull the model update code from the algorithm.

Thank you for your reply. In fact, I wrote my own method in the algorithm.py. If I use batch_size=16 or 24, would it be unfair compared to the other methods? Or, I need to retest all methods at batch_size=16 or 24, which may be time-consuming and costly. Therefore, in order to alleviate the insufficient memory of a single GPU, I wanted to solve the problem with multiple Gpus, but I tried some methods, but always failed.

I think you need to separate model update code (including loss backward and optimizer step) from the algorithm, to use DataParallel. Since the codebase is not designed for the multi-gpu originally, there may need several additional modifications.

I think you need to separate model update code (including loss backward and optimizer step) from the algorithm, to use DataParallel. Since the codebase is not designed for the multi-gpu originally, there may need several additional modifications.

Ok, thank you. I'll try.