Question: attn_head_scale with use_scalenorm
pfeatherstone opened this issue · comments
Am I right in thinking that using use_scalenorm == True
together with attn_head_scale == True
is pointless since ScaleNorm
will undo a learned scalar multiplicative parameter like what attn_head_scale
does.
yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads
i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way