Unexpected behavior in saving network as :float()

Question

Unexpected behavior in saving network as :float()

mbchang opened this issue 8 years ago · comments

Please see the issue posted in torch/torch7#711. I had accidentally submitted an issue for torch/nngraph there, and I'm not sure how to remove it. Thank you!

Soumith Chintala · Answer 1 · Thu Jul 07 2016 18:17:41 GMT+0800 (China Standard Time)

before loading the model, just execute the line:
require 'cunn'

mbchang · Answer 2 · Thu Jul 07 2016 23:25:20 GMT+0800 (China Standard Time)

@soumith I do realize that if my computer has gpu capabilites, executing require cunn lets me load the checkpoint, as mentioned in my post. However, I intend to load the checkpoint on a computer that does not have gpu capabilities, for which I can't install cunn in the first place. That point aside, I had expected that casting the network to Float type would remove any need for cuda dependencies.

Francisco Massa · Answer 3 · Tue Jul 26 2016 05:35:03 GMT+0800 (China Standard Time)

@soumith There is indeed a problem here.
I think the guilty guy is in this line. When we call type, the buffers are all cleared, but the references of the old forward inputs are still there. The same might apply for the backward nodes.

Two workarounds for the moment: after converting to float and before saving, either do

:clearState(), or
run a forward/backward pass using float data.

Another consequence is that the model is almost double the original size.

I think this should be addressed in nngraph though.

Soumith Chintala · Answer 4 · Wed Jul 27 2016 13:19:56 GMT+0800 (China Standard Time)

@fmassa I dont think that holds for type. What you are saying actually might be affecting clearState though.
To check what you said, I added this assertion to tests, and it passes.
08d0b5d

Soumith Chintala · Answer 5 · Wed Jul 27 2016 13:22:33 GMT+0800 (China Standard Time)

@mbchang reproduced your issue. I am looking into it.

mbchang · Answer 6 · Wed Jul 27 2016 13:24:42 GMT+0800 (China Standard Time)

Thanks @soumith, that would be a great help!

Francisco Massa · Answer 7 · Wed Jul 27 2016 13:27:57 GMT+0800 (China Standard Time)

@soumith tests passes because forward is runned after model conversion. Without the forward it will fail I think.
but my workaround didn't work for networks containing cudnn modules which were converted to nn, I had to recreate the module and copy the parameters by hand

Soumith Chintala · Answer 8 · Wed Jul 27 2016 13:29:22 GMT+0800 (China Standard Time)

oohh i see, ok makes sense. looking into it.

Francisco Massa · Answer 9 · Wed Jul 27 2016 13:36:43 GMT+0800 (China Standard Time)

I just realized that the line that my comment was pointing is off. I was referring to this line, which I'll copy here to avoid other ambiguities:

child.data.input[mapindex] = x

Soumith Chintala · Answer 10 · Wed Jul 27 2016 13:45:32 GMT+0800 (China Standard Time)

@mbchang @fmassa fixed it via #126 . Reinstall nngraph and it should be fixed.