Multi GPU Support

Question

Multi GPU Support

rlallen-nps opened this issue 3 years ago · comments

Any thoughts on building multi GPU support via dataparallel?

Clay Mullis · Answer 1 · Wed Sep 08 2021 05:17:26 GMT+0800 (China Standard Time)

@rlallen-nps sounds interesting; unfortunately I don't have access to the compute so am a bit lacking in motivation to get it working.

I'm not entirely certain how multi-GPU inference would work with this code base. It seems like a lot of work when there's presently a device argument which should allow you to run inference on multiple GPUs; just toward different generations.

rlallen-nps · Answer 2 · Wed Sep 08 2021 12:17:18 GMT+0800 (China Standard Time)

Check out datacrunch.io for cheap GPUs. The point of distributing one run over multiple GPUs is not to process more images, it's to process one generation at a much higher resolution.

Clay Mullis · Answer 3 · Wed Sep 08 2021 17:00:56 GMT+0800 (China Standard Time)

Check out datacrunch.io for cheap GPUs.

I'm aware but like I said; not super motivated by this. If I were to rent one it would be to max out settings for a single GPU - easily achievable on an RTX3090 or an A100 but I don't have any intention at this time of spending money on this project. I do all the testing from my RTX 2070 that I don't have to pay rent for fortunately.

The point of distributing one run over multiple GPUs is not to process more images, it's to process one generation at a much higher resolution.

The guided-diffusion checkpoints have harsh size constraints and must be trained from scratch for different sizes. The largest size is the 512 pixel checkpoint (which Katherine Crowson finetuned to be unconditioned).

The code for guided-diffusion (from OpenAI's fork) uses MPI for distributed I think? If you wanted to increase the resolution of the generations though, that's where I would go.

Clay Mullis · Answer 4 · Wed Sep 08 2021 17:06:10 GMT+0800 (China Standard Time)

I believe https://github.com/AranKomat/Diff-DALLE is also looking into training guided-diffusion on a transformer in the style of DALLE; there will definitely be interesting developments from that repository in the coming months and I fully expect it to surpass this method and for a few good checkpoints to be released as well.

rlallen-nps · Answer 5 · Fri Sep 10 2021 08:38:27 GMT+0800 (China Standard Time)

The guided-diffusion checkpoints have harsh size constraints and must be trained from scratch for different sizes. The largest size is the 512 pixel checkpoint (which Katherine Crowson finetuned to be unconditioned).

Ah, I see; I was still thinking in VQGAN mode. I'll let you know if I find anything interesting and thanks for the Diff-DALLE recommendation!