some questions regarding implementation + theory

Question

some questions regarding implementation + theory

zoharbarzelay opened this issue 4 years ago · comments

Hi danielkunin!

general

Thanks a lot for making this code available. it is so neat, easy to run, and so easy to get around. that's really great work!

questions

I have a couple of questions, i'd appreciate if you could help me out regarding them:

intuition for Eq. 5

Could you please provide some intuition regarding your paper's Eq. 5?

Specifically, you're doing the next steps:

you temporarily turn all the net's weights to positive values
you're putting an all-ones image into the network
you're summing up the model's result, and compute gradient w.r.t it.

My questions:

can you provide some intuition regarding this? why all-ones? why positive-weights? why the loss is the sum of the response?
previous methods use input data; why don't you use it too?

bug-fix in SNIP code

in your "snip with mask pruner" commit, you fixed the SNIP code to look at the gradients, and not at the weights' values.
My question:
did you re-evaluate your results? did it affect the superiority of your method?

Thanks!
Z.

Daniel Kunin · Answer 1 · Mon Aug 17 2020 22:03:08 GMT+0800 (China Standard Time)

Hi Z,

Intuition on Synaptic Flow Loss

The three steps you noticed in our code (positive weights, all-ones input, sum output) are done to exactly match the mathematical equation you posted above. Positive is the absolute value on the weights. All-ones input and summing is the outer and inner product with the all-ones vector.

The intuition for the equation is that it sums the value of every path from input pixel (i) to output class (j) where the value of a path is the product of the absolute value of the parameters on the path. Thus, a parameter with a higher synaptic flow score means that it is included in either more paths or paths with higher values or both. Thus, intuitively removing the parameters with the smallest scores will hopefully have the least impact on the neural networks initialization and the functions it can learn.

Why we don't use input data

The main motivation for ignoring the data was that through our analysis we demonstrated that iterations are essential for a pruning algorithm. Thus, in order to avoid the multiplicative computational cost of going through the data with each iteration we decided to construct a synaptic saliency score that was data-agnostic. The second motivation is that by not looking at the data then SynFlow can be understood as a "sparse initialization" found via pruning. We think that the success of SynFlow demonstrates a direction for future work improving how we initialize neural networks.

Implementation changes

Since we published the original pre-print there have been a couple of code changes. Most of these have been cosmetic and to make things cleaner. Two of the larger changes have been to the scoring functions for GraSP and SNIP as you noticed. The change to GraSP was to reflect implementation discrepancies between our code base and the original implementation. The change to SNIP was so we could run iterative version of SNIP correctly, however for a single iteration the two implementations are identical. We have re-run results and neither update has changed the overall empirical conclusions in our paper. We are currently reproducing the final empirical figure and will post an updated pre-print soon with a discussion about these changes.

zb · Answer 2 · Mon Aug 17 2020 22:20:25 GMT+0800 (China Standard Time)

Thanks for taking the time to answer with so much detail, I appreciate the effort.
Again, congrats for the readable, easy-to-reproduce code, and for the great paper behind it!

Zongze Wu · Answer 3 · Wed Aug 25 2021 01:47:53 GMT+0800 (China Standard Time)

@danielkunin This is a very interesting method. Have you try it on GAN? Use the same equation Eq. 5, the input noise is a all one vector, and sum the all value of output image. If we want to apply this method to GAN, do you have any suggestions?

Thank you for your help.