multi GPU training？

Question

multi GPU training？

PangzeCheung opened this issue 4 years ago · comments

I set the gpu_ids 2,3, but the program only runs on the GPU 2. Could you please tell me is the code support for multi GPU training? Thank you!

Ren Yurui · Answer 1 · Thu Apr 16 2020 16:15:13 GMT+0800 (China Standard Time)

You can use torch.nn.DataParallel to train the model using multi-GPU. See here

Specifically, if you want to train the pose-guided person image generation task, you can modify the "__ init __" function in pose_model.py. Add

self.net_G = torch.nn.DataParallel(self.net_G, device_ids=self.gpu_ids)
self.net_D = torch.nn.DataParallel(self.net_D, device_ids=self.gpu_ids)

Ren Yurui · Answer 2 · Thu Apr 16 2020 16:18:26 GMT+0800 (China Standard Time)

Currently, only the face animation model supports multi-GPU training.
We will update the code soon.
Thanks for asking.

PangzeCheung · Answer 3 · Thu Apr 16 2020 22:12:40 GMT+0800 (China Standard Time)

@RenYurui Thank you very much!

Maverick · Answer 4 · Wed Sep 16 2020 15:56:30 GMT+0800 (China Standard Time)

Hi @RenYurui,

Nice work!
It seems like even after I use the DataParallel in pose_flownet using below command, the model still uses a single GPU.
self.net_G = torch.nn.DataParallel(self.net_G, device_ids=self.gpu_ids)

It seems like all data is only loaded on first GPU in your code as show below:

            self.input_P1 = input_P1.cuda(self.gpu_ids[0], async=True)
            self.input_BP1 = input_BP1.cuda(self.gpu_ids[0], async=True)
            self.input_P2 = input_P2.cuda(self.gpu_ids[0], async=True)
            self.input_BP2 = input_BP2.cuda(self.gpu_ids[0], async=True)

I tried to replace above with just .cuda() but still I am not able to spread batch data across multiple GPU's and first GPU is running out of memory when I uses larger batch size. Is it the case that your custom built CUDA operations don't support multiple GPUs?

Thanks,
Bhavan