orobix / fwdgrad

Implementation of "Gradients without backpropagation" paper (https://arxiv.org/abs/2202.08587) using functorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

accuracy of forwardgrad isn't as good as regular backprop

ilonadem opened this issue · comments

Hi, really cool implementation!

I noticed that when I run your examples although both models achieve convergence, the accuracy of the forward grad method is always worse than that of regular backprop. In the paper they mention that the accuracy of the forward gradient should be pretty comparable/identical to that of backpropagation, is this behavior expected?

I was able to improve things marginally by having the model perform several random perturbations and taking the average of those for the parameter update in each forward pass (since this means that it is likelier to actually find the direction of the true gradient), but wasn't ever able to replicate backprop performance.

Hi @ilonadem, thank you for your words. Coming to your issue:

I noticed that when I run your examples although both models achieve convergence, the accuracy of the forward grad method is always worse than that of regular backprop. In the paper they mention that the accuracy of the forward gradient should be pretty comparable/identical to that of backpropagation, is this behavior expected?

Yes, they should be pretty comparable, although it's something that we have never measured. To this end we can add some test functions to test the trained model and add some results. If you already have something you can also open a PR :)

I was able to improve things marginally by having the model perform several random perturbations and taking the average of those for the parameter update in each forward pass (since this means that it is likelier to actually find the direction of the true gradient), but wasn't ever able to replicate backprop performance.

Yes, in our example we are estimating the expected value with only one sample, and more samples you use more precise become your estimation. This is also something that could be useful to have in our examples: we can add the number of samples to use in the estimation by setting some hydra parameters. Again, feel free to open a PR in case :)