microsoft / mup

maximal update parametrization (µP)

Home Page:https://arxiv.org/abs/2203.03466

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?

ricomnl opened this issue · comments