PDillis / stylegan3-fun

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training stalls when using multiple GPU's

nuclearsugar opened this issue · comments

I have been struggling to utilize 2 GPU's when training. After executing the code below, everything loads as usual, and then it stalls when reaching the training step. But when I execute the code below using <--gpus=1> then it run perfectly.
python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

I'm not running out of VRAM (x2: Quadro RTX 5000 16GB) or RAM (32GB). Here is a screenshot where you can see both GPU's have 0% load for an extended time:
2023-04-04 16_04_10-Greenshot

I believe that both GPU's are correctly setup and StyleGAN2 should be able to use them both. Here is a screenshot after having run:
nvidia-smi
2023-04-04 16_07_56-Window

I was doing some googling to see if anyone else has had a similar issue... And interestingly this recent issue over on the original repository seems to describe my problem precisely. Yet when I tried out the suggested fix then I still experienced the same problem as before with it stalling upon reaching the training step.

Am I missing some detail or is this a bug? Thanks!

I looked through the history of issues and here are 3 others with the same bug:

In prior tests I was relying on CUDA 11.1.

Seeing as how the environment.yml lists CUDA 11.3, I thought it would be worth testing out with the required CUDA library version. It took some tinkering but I was able to get CUDA 11.3 functional with the latest version of this repo. But I'm still seeing the same stalling behavior. So it stalls when executing --gpus=2, but --gpus=1 runs smoothly.

I tried another few tests where I set the environment variable to a specific GPU so that the StyleGAN training would only execute on a specific GPU. So I can confirm that both of my GPU's are setup correctly for use in Python.

Training runs smoothly on GPU0.
--- set CUDA_VISIBLE_DEVICES=0
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training runs smoothly on GPU1.
--- set CUDA_VISIBLE_DEVICES=1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training stalls as described prior.
--- set CUDA_VISIBLE_DEVICES=0,1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

I was finally able to get the training to execute successfully on 2 GPU's after following the directions found over on issue 218. It's a bit of a hack but it works. FYI I'm running Windows 10.

Would it be possible to implement a more permanent fix for this bug?

That is indeed a bit of a hack. I haven't encountered errors when training with multiple GPUs (RTX 6000 and A40s), so perhaps there's something else I'm missing. I'll try to figure it out, but if you can share more on your environment and such, that'd be helpful to narrow it down.

I saw a comment from a contributor on the StyleGAN3 codebase mentioning that they don't typically run mult-GPU setups using Windows, presumably Linux instead. So I'm not sure how heavily it's been tested on Windows. The other issues linked above also mention using Windows, so that seems telling.

Below is some info about my environment setup and hardware. Let me know if you need any other details.

Software Environment

  • Windows 10 (21H2)
  • Visual Studio 2019
  • CUDA Toolkit 11.3
  • Instance running within Miniconda3-py39
  • Using the exact same dependencies as listed within environment.yml

Hardware

  • CPU: AMD Ryzen 5950X
  • GPU's: (x2) Nvidia Quadro RTX 5000 16GB
  • RAM: 32GB