Where is SignSGD performed ?

Question

Where is SignSGD performed ?

manishadubey91 opened this issue 6 years ago · comments

I am unable to figure out where exactly Sign of gradient is being taken into consideration (except in the toy example) ?

Jeremy Bernstein · Answer 1 · Sun Nov 04 2018 02:19:11 GMT+0800 (China Standard Time)

Hi @manishadubey91, sorry this is unclear. You have to pass in the optimiser as a command line argument. For example:

python train_resnet.py --optim signum --lr 0.0001 --wd 0.00001

This works because signum was implemented in the mxnet deep learning framework (see this page). I can also share Pytorch code for the optimiser if that would help.

Amit Portnoy · Answer 2 · Mon Oct 05 2020 13:06:19 GMT+0800 (China Standard Time)

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Jeremy Bernstein · Answer 3 · Wed Oct 28 2020 02:45:19 GMT+0800 (China Standard Time)

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Hi @amitport, you're right and thanks for pointing this out. In this paper, we used an implementation of the sign function that quantised positiive gradients to +1, negative gradients to -1, and 0 gradients to 0. I think this was done at the time under the (naïve) assumption that a gradient component exactly zero was unlikely to occur in practice. I'm planning to run some experiments to test if/how much this makes a difference to convergence, and will report back.

Jeremy Bernstein · Answer 4 · Tue Jan 12 2021 19:53:07 GMT+0800 (China Standard Time)

Hi @amitport, I tested the difference between the version that sends sign(0) --> 0 and the version that sends sign(0) --> ±1 at random. The tests and results are in this Jupyter notebook. At least for training Resnet-18 on CIFAR-10, there was little difference between the two implementations.

That being said, in the distributed experiments in the ICLR 2019 paper, we used an implementation of the sign function that maps sign(0) --> +1 deterministically. So if this issue still bothers you (it bothers me) then it's safer to look at the experimental results in that paper. The compression in that paper is carried out in bit2byte.cpp which gets called by compressor.py.

Amit Portnoy · Answer 5 · Wed Jan 13 2021 19:53:26 GMT+0800 (China Standard Time)

@jxbz thank you. I just wanted to make sure I understand what was used in the graphs which I guess is one bit sign.

In any case, we can probably agree that ternary sign {-1, 0, 1} is significantly better than one bit sign so the distinction is meaningful. (And also that randomizing 0 is a big improvement over simple one bit sign).