Ablate on initialization
mitchellnw opened this issue · comments
Mitchell Wortsman commented
interested in:
A) changing layer_id + 1 to args.num_layers.
B) removing the line std = std / math.sqrt(2 * (layer_id + 1))
Achal Dave commented
related: #225
Achal Dave commented
We tested #225 at 1B and it seems to hurt downstream evals significantly, unfortunately.