About the squash operator of long convolution
ahmed-fau opened this issue · comments
Hi,
Thanks for sharing this nice wok. I have one question regarding putting the squashing operator in the forward method of the long convolution module:
safari/standalone_long_convs.py
Line 71 in 02220c6
Doesn't this lead to complete zero kernels as training continues for large training steps? in other words, you keep eroding kernel weights until reaching zero everywhere after e.g., 1M training steps. Is this correct?
Thanks!
Great question!
The key is that this operator is applied during the forward pass, not iteratively during backprop.
So for example, if your kernel weights are [1.0, 0.2, 0.1, 0.7]
and your kernel_lam
is [0.3]
, then your kernel is [0.7, 0.0, 0.0, 0.4]
after the squash operator. If that kernel is good for your task (e.g. low loss), then great! The original weights won't need to move from those.
You would only go towards zero if, during training, the gradient encourages the original weights to be smaller. So in that way it works exactly like you would expect SGD to work!
Thanks again for the clear and prompt reply!