HazyResearch / safari

Hi,
Thanks for sharing this nice wok. I have one question regarding putting the squashing operator in the forward method of the long convolution module:

safari/standalone_long_convs.py

Line 71 in 02220c6

k = F.relu(torch.abs(k)-self.kernel_lam)*torch.sign(k)

Doesn't this lead to complete zero kernels as training continues for large training steps? in other words, you keep eroding kernel weights until reaching zero everywhere after e.g., 1M training steps. Is this correct?

Thanks!

Great question!

The key is that this operator is applied during the forward pass, not iteratively during backprop.

So for example, if your kernel weights are [1.0, 0.2, 0.1, 0.7] and your kernel_lam is [0.3], then your kernel is [0.7, 0.0, 0.0, 0.4] after the squash operator. If that kernel is good for your task (e.g. low loss), then great! The original weights won't need to move from those.

You would only go towards zero if, during training, the gradient encourages the original weights to be smaller. So in that way it works exactly like you would expect SGD to work!

Thanks again for the clear and prompt reply!

About the squash operator of long convolution