epfml / powersgd

Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

what's the difference between paper code and recent code?

jaewonalive opened this issue · comments

Hi,

I'm impressed by the efficiency of PowerSGD.

Could you let me know the difference between paper code and the recent code which is located at powersgd/powergsd.py

Is there any performance difference?

I think the paper code and the recent code is not identical.

Does recent code also converge well?

Can I know the reason?

Hi @jaewonalive,

Thanks for your message. Indeed, they are not equivalent. The main difference is that the algorithm looks a bit like Algorithm 2 in this follow-up paper.

We found that there are two ways to control the approximation quality in PowerSGD: the first is the 'rank' of the approximation, and the second is the 'number of iterations'. Because the cost of orthogonalisation grows as O(rank^2), increasing the rank can become inefficient, leaving changing the number of iterations as the only good option.

In the original PowerSGD paper, using more iterations only improves the quality of the rank-k approximation, as it converges closer and closer to the "best rank k approximation". In the follow-up paper, intermediate results from the iterations are all used, effectively increasing the rank as the number of iterations grows.

In the original PowerSGD paper, we always used two iterations per SGD step (a left and a right iteration). In this setting, there is not much of a difference. The difference appears when you use more power iteration steps per SGD step.

Hope this explains it a bit. I'll also add this to the README. Don't hesitate to reach out if you have any other questions.