Training stuck indefinitely

Question

Training stuck indefinitely

saswat0 opened this issue a year ago · comments

I'm trying to train the diffuser from scratch using a custom dataset (FFHQ), but the process gets stuck indefinitely. Here's the script that I'm using to run the job.

export PYTHONPATH=.:$PYTHONPATH

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 64"

mpiexec -n 2 -verbose python scripts/image_train.py --data_dir ./data/padding_025 $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

There are no logs displayed either.

Saswat Subhajyoti · Answer 1 · Mon Jun 12 2023 18:59:52 GMT+0800 (China Standard Time)

This fixed it

export NCCL_P2P_DISABLE=1