openai / improved-diffusion

Release for Improved Denoising Diffusion Probabilistic Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training stuck indefinitely

saswat0 opened this issue · comments

I'm trying to train the diffuser from scratch using a custom dataset (FFHQ), but the process gets stuck indefinitely. Here's the script that I'm using to run the job.

export PYTHONPATH=.:$PYTHONPATH

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 64"

mpiexec -n 2 -verbose python scripts/image_train.py --data_dir ./data/padding_025 $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

There are no logs displayed either.

This fixed it

export NCCL_P2P_DISABLE=1