bytedance / ibot

iBOT :robot:: Image BERT Pre-Training with Online Tokenizer (ICLR 2022)

Home Page:https://arxiv.org/abs/2111.07832

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

checkpoint not saved by master

luuuyi opened this issue · comments

commented

as your code describe,

torch.save(save_dict, os.path.join(args.output_dir, 'checkpoint.pth'))

in ddp training, every processor(GPU) would save an checkpoint model in disk, this behaviou may cause duplicate writing problem and saved checkpoint can not be load by torch.load successfully

Hi,

Thanks for your pointer! Problem fixed.

hi, could you please tell me how to fix this problem?
Thanks so much !

hi, could you please tell me how to fix this problem? Thanks so much !

Hi,

Check the following line:

utils.save_on_master(save_dict, os.path.join(args.output_dir, 'checkpoint.pth'))