About validating
rxqy opened this issue · comments
Hi, thanks for the nice guide here. Really saves me lots of time.
I have another question about saving/loading the model and validating.
From the imagenet example > https://github.com/pytorch/examples/blob/master/imagenet/main.py
We only need to save our model once on only rank0 device, right?
And I wrote another standalone script for validating on one gpu only (with batchsize=1), do we still need to warp it up with distributed parallel and convert the model to use syncbn?
Many thx!
HI @rxqy,
Yes, for checkpointing you can save weights only on the process rank 0, it works fine. What I do is to keep a reference for the model unwrapped, this is the model I use to save checkpoints. So when you load the model it works fine as a 'standalone' model.
Likewise, you should be able to access the model inside the the wrapper and use it to save a checkpoint:
mymodel = wrapped_parallel.model
I prefer the first method.
Cheers!
many thx!