checkpoint not saved by master
luuuyi opened this issue · comments
as your code describe,
Line 358 in 3302b63
in ddp training, every processor(GPU) would save an checkpoint model in disk, this behaviou may cause duplicate writing problem and saved checkpoint can not be load by
torch.load
successfullyHi,
Thanks for your pointer! Problem fixed.
hi, could you please tell me how to fix this problem?
Thanks so much !
hi, could you please tell me how to fix this problem? Thanks so much !
Hi,
Check the following line:
Line 358 in da316d8