Training stuck indefinitely
saswat0 opened this issue · comments
Saswat Subhajyoti commented
I'm trying to train the diffuser from scratch using a custom dataset (FFHQ), but the process gets stuck indefinitely. Here's the script that I'm using to run the job.
export PYTHONPATH=.:$PYTHONPATH
MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 64"
mpiexec -n 2 -verbose python scripts/image_train.py --data_dir ./data/padding_025 $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
There are no logs displayed either.
Saswat Subhajyoti commented
This fixed it
export NCCL_P2P_DISABLE=1