Gradient Accumulation / Rolling Gradients
Micky774 opened this issue · comments
Meekail Zain commented
This issue will particularly focus on accumulating gradients which we'll refer to as rolling. That is to say, we should add an argument --roll k which allows us to specify a factor k by which to roll/accumulate the gradients. So gradients are accumulated and only backpropogated/zeroed every k batches. This results in an effective batch size of kNM where N is the entered per-card batch-size, M is the number of cards being used, and k is the roll factor.
As an example, running a per-card batch-size of 25 with a roll factor of 3 while training on 2 cards would result in an effective batch-size of 150.