openai / consistency_models

Official repo for consistency models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

“multi-gpu error” dist.all_gather(gathered_samples, sample) # gather not supported with NCCL

fikry102 opened this issue · comments

mpiexec -n 8 python scripts/image_sample.py --batch_size 32 --training_mode consistency_distillation --sampler multistep --ts 0,62,150 --steps 151 --model_path ./ct_cat256.pt --attention_resolutions 32,16,8 --class_cond False --use_scale_shift_norm False --dropout 0.0 --image_size 256 --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --num_samples 500 --resblock_updown True --use_fp16 True --weight_schedule uniform

"home/anaconda3/envs/consistency/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2433, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'

Traceback (most recent call last):
File "scripts/image_sample.py", line 143, in
main()
File "scripts/image_sample.py", line 91, in main
dist.all_gather(gathered_samples, sample) # gather not supported with NCCL