Question: attn_head_scale with use_scalenorm

Question

Question: attn_head_scale with use_scalenorm

pfeatherstone opened this issue 8 months ago · comments

Am I right in thinking that using use_scalenorm == True together with attn_head_scale == True is pointless since ScaleNorm will undo a learned scalar multiplicative parameter like what attn_head_scale does.

Phil Wang · Answer 1 · Thu Oct 26 2023 00:14:19 GMT+0800 (China Standard Time)

yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads

i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way