microsoft / mup

maximal update parametrization (µP)

Home Page:https://arxiv.org/abs/2203.03466

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

interpreting coord checks

llucid-97 opened this issue · comments

Hi there, I'm working on a flax port of this and I'm trying to use the coord check scripts on a variant of your MLP example to see if I've done it right. I'm struggling to interpret the results though:

sp_mlp_sgd_coord
μp_mlp_sgd_coord

The point I'm confused on is the green line in the muP graph step 1: if I understood your paper correctly, this should be a flat line right?
Looking through my code, i can't spot the mistake though, so I must ask, is my assumption about step 1 of the coord check wrong?

Hi! Does the green curve correspond to the last layer? If so, this is expected. It is for a related reason that we recommend initializing the last layer weights to zero.

Aah I see. Yes it is for the last layer. Thanks