Why is a small number of parameters require a very large memory?

Question

Why is a small number of parameters require a very large memory?

FlyuZ opened this issue 3 years ago · comments

I found that this model is much smaller than the parameter or the amount of operation is much smaller than HRNET, but the memory occupied by the training is particularly large. Why is this?Is this the characteristics of VIT?
Thank you for your answer.

senius · Answer 1 · Thu Jul 22 2021 10:14:16 GMT+0800 (China Standard Time)

Hi, @FlyuZ. The number of parameters of this model is smaller than HRNet, but the calculation amount and occupied memory are usually larger than it. You are right. This can be attributed to the characteristics of Transformer. Self-attention computes pairwise inner product between pairwise input contents with only needing few model weight parameters, while CNN mainly computes matrix multiplications between input contents and convolution kernel weights.