Gradient Accumulation / Rolling Gradients

Question

Gradient Accumulation / Rolling Gradients

Micky774 opened this issue 4 years ago · comments

This issue will particularly focus on accumulating gradients which we'll refer to as rolling. That is to say, we should add an argument --roll k which allows us to specify a factor k by which to roll/accumulate the gradients. So gradients are accumulated and only backpropogated/zeroed every k batches. This results in an effective batch size of kNM where N is the entered per-card batch-size, M is the number of cards being used, and k is the roll factor.

As an example, running a per-card batch-size of 25 with a roll factor of 3 while training on 2 cards would result in an effective batch-size of 150.