dougsouza / pytorch-sync-batchnorm-example

How to use Cross Replica / Synchronized Batchnorm in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About validating

rxqy opened this issue · comments

commented

Hi, thanks for the nice guide here. Really saves me lots of time.
I have another question about saving/loading the model and validating.

From the imagenet example > https://github.com/pytorch/examples/blob/master/imagenet/main.py
We only need to save our model once on only rank0 device, right?
And I wrote another standalone script for validating on one gpu only (with batchsize=1), do we still need to warp it up with distributed parallel and convert the model to use syncbn?

Many thx!

HI @rxqy,

Yes, for checkpointing you can save weights only on the process rank 0, it works fine. What I do is to keep a reference for the model unwrapped, this is the model I use to save checkpoints. So when you load the model it works fine as a 'standalone' model.

Likewise, you should be able to access the model inside the the wrapper and use it to save a checkpoint:

mymodel = wrapped_parallel.model

I prefer the first method.

Cheers!

commented

many thx!