jxbz / signSGD

Code for the signSGD paper

Home Page:https://arxiv.org/abs/1802.04434

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Where is SignSGD performed ?

manishadubey91 opened this issue · comments

I am unable to figure out where exactly Sign of gradient is being taken into consideration (except in the toy example) ?

Hi @manishadubey91, sorry this is unclear. You have to pass in the optimiser as a command line argument. For example:

python train_resnet.py --optim signum --lr 0.0001 --wd 0.00001

This works because signum was implemented in the mxnet deep learning framework (see this page). I can also share Pytorch code for the optimiser if that would help.

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

this is the implementation you're referring to right?https://github.com/apache/incubator-mxnet/blob/f70c7b7b1e246e32e322ba059f8bf0e5d01a22be/src/operator/optimizer_op-inl.h#L2303

seems to be using 2 bits: (-1, 0, 1)

Hi @amitport, you're right and thanks for pointing this out. In this paper, we used an implementation of the sign function that quantised positiive gradients to +1, negative gradients to -1, and 0 gradients to 0. I think this was done at the time under the (naïve) assumption that a gradient component exactly zero was unlikely to occur in practice. I'm planning to run some experiments to test if/how much this makes a difference to convergence, and will report back.

Hi @amitport, I tested the difference between the version that sends sign(0) --> 0 and the version that sends sign(0) --> ±1 at random. The tests and results are in this Jupyter notebook. At least for training Resnet-18 on CIFAR-10, there was little difference between the two implementations.

That being said, in the distributed experiments in the ICLR 2019 paper, we used an implementation of the sign function that maps sign(0) --> +1 deterministically. So if this issue still bothers you (it bothers me) then it's safer to look at the experimental results in that paper. The compression in that paper is carried out in bit2byte.cpp which gets called by compressor.py.

@jxbz thank you. I just wanted to make sure I understand what was used in the graphs which I guess is one bit sign.

In any case, we can probably agree that ternary sign {-1, 0, 1} is significantly better than one bit sign so the distinction is meaningful. (And also that randomizing 0 is a big improvement over simple one bit sign).