Very large model files when using equivariance

Question

Very large model files when using equivariance

drewm1980 opened this issue 4 years ago · comments

I'm experiencing some surprisingly large models when using equivariance. I am taking advantage of the (very convenient!) TrivialOnR2 fallback for comparisons been my non-equivariant and my equivariant model. This means all of the code is the same aside from the layer (and tensor) types.

Sizes of saved model files:

TrivialOnR2(): 18M
FlipRot(4): 733M

That's a 40X difference in size.
I get gpu OOM killed when I try to load it onto my local machine's 2GB gpu.

I was under the impression that there was no extra cost for equivariance at inference time. Even with a naive guess that you pay linearly for the equivariance, I would expect FlipRot(4) to be about 8X bigger.

I unzipped the model file (the size was unchanged, so I guess it wasn't actually compressed).

These are the largest files in the "data" subdirectory of the archive:

144.0 MiB [##########] 93996286070688
135.2 MiB [######### ] 93996315383952
103.6 MiB [####### ] 93996297647440
36.0 MiB [## ] 93996309406176
36.0 MiB [## ] 93996301356576
31.6 MiB [## ] 93996287309120
24.8 MiB [# ] 93996405222784
23.0 MiB [# ] 93996315305440
23.0 MiB [# ] 93996283837952
15.5 MiB [# ] 93996319486144
15.5 MiB [# ] 93996405222048
15.5 MiB [# ] 93996309417568

Running my model through (a slightly hacked) pytorch-model-summary:

For TrivialOnR2:

                Layer (type)         Input Shape         Param #     Tr. Param #

=====================================================================================
UNetConvBlock-1 1,856 1,856
PointwiseMaxPoolAntialiased-2 0 0
UNetConvBlock-3 9,856 9,856
PointwiseMaxPoolAntialiased-4 0 0
UNetConvBlock-5 44,288 44,288
PointwiseMaxPoolAntialiased-6 0 0
UNetConvBlock-7 186,880 186,880
PointwiseMaxPoolAntialiased-8 0 0
UNetConvBlock-9 766,976 766,976
UNetUpBlock-10 447,488 447,488
UNetUpBlock-11 125,440 125,440
UNetUpBlock-12 29,952 29,952
UNetUpBlock-13 6,784 6,784
R2Conv-14 208 208

Total params: 1,619,728
Trainable params: 1,619,728
Non-trainable params: 0

For FlipRot(4):
In [4]: s = summary(model, dummy_input, show_input=True, print_summary=True, show_hierarchical=False)

                Layer (type)         Input Shape         Param #     Tr. Param #

=====================================================================================
UNetConvBlock-1 12,608 12,608
PointwiseMaxPoolAntialiased-2 0 0
UNetConvBlock-3 74,368 74,368
PointwiseMaxPoolAntialiased-4 0 0
UNetConvBlock-5 345,344 345,344
PointwiseMaxPoolAntialiased-6 0 0
UNetConvBlock-7 1,477,120 1,477,120
PointwiseMaxPoolAntialiased-8 0 0
UNetConvBlock-9 6,099,968 6,099,968
UNetUpBlock-10 3,558,400 3,558,400
UNetUpBlock-11 992,768 992,768
UNetUpBlock-12 234,240 234,240
UNetUpBlock-13 51,584 51,584
R2Conv-14 208 208

Total params: 12,846,608
Trainable params: 12,846,608
Non-trainable params: 0

There actually ~are 8X more trainable parameters than with TrivialOnR2, so I was probably doing an unfair comparison; My TrivialOnR2 should get 8X more channels.

So the parameters take up 51 MB, leaving 682 MB of unexplained size. Do you have a guess what mistake I might have made to be causing that bloat? Thanks!

Gabriele Cesa · Answer 1 · Fri Aug 28 2020 16:43:42 GMT+0800 (China Standard Time)

Hi @drewm1980 ,

would it be possible to share the code you used to define your models? That would make it easier for me to understand the architecture.

Be careful that if you preserve the number of fields in each layer (e.g. you always write something like ft = FieldType(gs, [gs.regular_repr]*8) regardless of which group is used in gs), the cost of your conv layer will grow quadratically with the group size.
This is because the size of each feature will grow linearly with the group size, but the cost of convolution/linear layer grows quadratically with the feature sizes.

The fact you have ~8 times more params in the FlipRotOnR2(4) model, makes me think your features are actually ~8 times larger.
Is that correct?

In general, it's not usually necessary to scale the feature size linearly with the group size.
Sometimes, if your group is not too large, it is not even necessary to scale up the feature size at all, preserving the original computational cost.
To keep the number of channels more or less fixed, you can do something like this:

# "gs" is your gspace, e.g. FlipRotOnR2(4)

C = 64
c = int(C / gs.fibergroup.order())
ft = FieldType(gs, [gs.regular_repr]*c)

I would try to start with a model where you don't scale up the number of channels, and then gradually increase it if necessary to get better performance.

Let me know if this helps

Best,
Gabriele

Andrew Wagner · Answer 2 · Sat Aug 29 2020 19:22:06 GMT+0800 (China Standard Time)

Thanks for the response! I'll try to work up a minimized example I can share, but it might be week or two before I'm back on the relevant project. This may just be expected behavior; I'll try increasing the non-equivariant network's channels to match (x 8 everywhere), and re-compare the sizes. If they're comparable, then I just wasn't aware enough of the cost of scaling the equivariance group.

Andrew Wagner · Answer 3 · Thu Dec 03 2020 05:17:06 GMT+0800 (China Standard Time)

Yeah, your explanation was spot on; I was indeed just counting incorrectly. Sorry for the slow reply!

Very large model files when using equivariance

For TrivialOnR2:

Total params: 1,619,728 Trainable params: 1,619,728 Non-trainable params: 0

For FlipRot(4): In [4]: s = summary(model, dummy_input, show_input=True, print_summary=True, show_hierarchical=False)

Total params: 12,846,608 Trainable params: 12,846,608 Non-trainable params: 0

Total params: 1,619,728
Trainable params: 1,619,728
Non-trainable params: 0

For FlipRot(4):
In [4]: s = summary(model, dummy_input, show_input=True, print_summary=True, show_hierarchical=False)

Total params: 12,846,608
Trainable params: 12,846,608
Non-trainable params: 0