bytedance / ibot

as your code describe,

Line 358 in 3302b63

torch.save(save_dict, os.path.join(args.output_dir, 'checkpoint.pth'))

in ddp training, every processor(GPU) would save an checkpoint model in disk, this behaviou may cause duplicate writing problem and saved checkpoint can not be load by torch.load successfully

Hi,

Thanks for your pointer! Problem fixed.

hi, could you please tell me how to fix this problem?
Thanks so much !

hi, could you please tell me how to fix this problem? Thanks so much !

Hi,

Check the following line:

ibot/main_ibot.py

Line 358 in da316d8

    
           utils.save_on_master(save_dict, os.path.join(args.output_dir, 'checkpoint.pth'))

checkpoint not saved by master