microsoft / mup

maximal update parametrization (µP)

Home Page:https://arxiv.org/abs/2203.03466

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are parameters with no "infinite" dimensions allowed?

callumm-graphcore opened this issue · comments

Hi,

Is it valid to have parameters that have no "infinite" dimensions? This line suggests that it is, but I can't find anything in the paper that explains how this case should be dealt with.

With thanks,
Callum

Hi Callum,

Yes, it's possible to have parameters with only finite dimensions. For example, given a finite output dimension d_out, the bias vector for the last layer will have dimension 1 x d_out.

Thanks Edward! Is there a part of the paper that explains what the correct scaling is in this case? Would this apply even if you had a linear layer where neither the input nor the output dimension was scaled?

The bias example I gave is covered under input weights & biases in Table 3, 8, and 9, and it has a constant init and LR.

Yes, it also applies when you have a linear layer. We might not have talked about it specifically in the paper since it's less common, but you should use a constant init and LR.

Ah, OK, I see now. Thank you very much!