Model loading failed when training paralleled

Question

Model loading failed when training paralleled

waynehuu opened this issue 5 years ago · comments

bohaohuang · Answer 1 · Sat Nov 02 2019 01:23:17 GMT+0800 (China Standard Time)

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

Wei (Wayne) Hu · Answer 2 · Sat Nov 02 2019 02:55:42 GMT+0800 (China Standard Time)

An explanation & solution of batchnorm when training on multiple gpus:
https://github.com/dougsouza/pytorch-sync-batchnorm-example

I'll take a look at this, also here's the link for the model: https://drive.google.com/file/d/12Qr7SUhGTWugqJ9AvBEl4aDTDOvTEm-h/view?usp=sharing

bohaohuang · Answer 3 · Fri Jan 31 2020 04:48:11 GMT+0800 (China Standard Time)

#47 solves the optimizer issue when resume training a model

Wei (Wayne) Hu · Answer 4 · Tue Mar 03 2020 06:07:09 GMT+0800 (China Standard Time)

This hasn't been fixed yet.

The problem is not about the "module" prefix in multi-gpu state_dict keywords. Multi-gpu trained models can be loaded successfully programming wise but they don't perform as they should do. Probably due to batch normalization being calculated on separate device and is not synchronized across devices. I tested this last week and the previous optimizer fix doesn't solve it.

bohaohuang · Answer 5 · Wed Mar 04 2020 02:05:12 GMT+0800 (China Standard Time)

A quick fix would be:
model.encoder = nn.DataParallel(model.encoder)
model.decoder = nn.DataParallel(model.decoder)
network_utils.load(model, ckpt_dir, disable_parallel=True)
But l think this is due to the data parallel wrapping in the training process, let me investigate this a little bit

bohaohuang · Answer 6 · Wed Mar 04 2020 03:06:13 GMT+0800 (China Standard Time)

I believe #53 has fixed the issue

When doing the evaluation, please load the model via:
network_utils.load(model, ckpt_dir)
instead of:
network_utils.load(model, ckpt_dir, disable_parallel=True)

This way the framework will try to wrap the model with nn.DataParallel instead of create a matching pattern to load the weights

I have tried and it seems have fixed the issue. But feel free to reopen it if it does not solve your problem

Also, one down side of the current fix is that it might not be downward-compatible with previous multi-gpu trained models

bohaohuang · Answer 7 · Thu Mar 05 2020 01:21:32 GMT+0800 (China Standard Time)

The new method could not distribute memory across multiple gpus

bohaohuang · Answer 8 · Thu Mar 05 2020 03:22:30 GMT+0800 (China Standard Time)

8b23932 should've fixed this issue:

encoder and decoder still need to be wrapped with DataParallel separately to enable memory distributing across gpus
model attributes need to be forwarded by custom DataParallel class to avoid OOM error at inference after loading the model

bohaohuang · Answer 9 · Sat Mar 07 2020 02:24:38 GMT+0800 (China Standard Time)

When training in multiple gpus, model can only be loaded with gpu:0, not gpu 1. And most of the times still get OOM erros

bohaohuang · Answer 10 · Sat Mar 07 2020 04:14:33 GMT+0800 (China Standard Time)

288b2ef fixs this issue:
gpu loading error is solved by setting the primary device properly for dataparallel, OOM error seems like a cuda bug that occurs rarely