About validating

Question

About validating

rxqy opened this issue 4 years ago · comments

Hi, thanks for the nice guide here. Really saves me lots of time.
I have another question about saving/loading the model and validating.

From the imagenet example > https://github.com/pytorch/examples/blob/master/imagenet/main.py
We only need to save our model once on only rank0 device, right?
And I wrote another standalone script for validating on one gpu only (with batchsize=1), do we still need to warp it up with distributed parallel and convert the model to use syncbn?

Many thx!

rxqy commented 4 years ago

many thx!

Douglas Souza · Answer 1 · Tue Jun 23 2020 02:40:09 GMT+0800 (China Standard Time)

HI @rxqy,

Yes, for checkpointing you can save weights only on the process rank 0, it works fine. What I do is to keep a reference for the model unwrapped, this is the model I use to save checkpoints. So when you load the model it works fine as a 'standalone' model.

Likewise, you should be able to access the model inside the the wrapper and use it to save a checkpoint:

mymodel = wrapped_parallel.model

I prefer the first method.

Cheers!