LayerNorm after the SS2D
ydhongHIT opened this issue · comments
Hi, is applying a LN layer after the SSM a default setting in Mamba? If not, is there any ablation experiments about the function of the LN layer?
Our original attempt for adding ln is to avoid collapsing in training. It seems that the output of s6 is often too large (then get inf and finally NaN) to be in data type float16.
Our original attempt for adding ln is to avoid collapsing in training. It seems that the output of s6 is often too large (then get inf and finally NaN) to be in data type float16.
Thank you for your reply. By the way, did you try the Layerscale? I guess it may help mitigate the overfitting of large models.
Thank you for your advice, we'll try it in the future.