Distributed training

Question

Distributed training

jppgks opened this issue 6 years ago · comments

Thanks for open sourcing the code for this awesome paper!

I’m wondering if you used distributed training of the different GAN models during experimentation. If so, could you share an example of how to launch a distributed training job using compare_gan code?

KK · Answer 1 · Fri Mar 23 2018 06:23:08 GMT+0800 (China Standard Time)

Hi Joppe,

the training of a single GAN is done on a single GPU (it's relatively fast for the architecture and datasets that we used).

We launched multiple experiments in parallel - first by running compare_gan_generate_tasks to create a set of experiment to run, then by running compare_gan_run_one_task on many machines (machine 0 with task_num=0, machine 1 with task_num=1, etc)

Denis Akhiyarov · Answer 2 · Wed Sep 18 2019 23:16:49 GMT+0800 (China Standard Time)

@jppgks comparing multiple experiments in parallel is nothing like distributed training, unless hyper-parameter optimization is the end goal. Is this what you mean by multiple tasks?

Marvin · Answer 3 · Thu Sep 19 2019 04:30:11 GMT+0800 (China Standard Time)

Note: We have updated the framework in the meantime and it now supports distributed training (single run on multiple machines) for TPUs.

Denis Akhiyarov · Answer 4 · Fri Sep 20 2019 03:27:12 GMT+0800 (China Standard Time)

@Marvin182 where can I find this in the code?

Marvin · Answer 5 · Wed Sep 25 2019 18:20:04 GMT+0800 (China Standard Time)

https://github.com/google/compare_gan