Few questions on real world usage of infinibatch while training models

Question

Few questions on real world usage of infinibatch while training models

sai-prasanna opened this issue 4 years ago · comments

Hi, I have been experimenting with this awesome library. I made a blog post on this https://saiprasanna.in/posts/efficient-dynamic-batching-of-large-datasets-with-infinibatch/ .Making dynamic batches based on tokens per batch rather than fixed batch size has huge advantages in terms of reducing total number of batches. I have few questions with regards to convergence in such dynamic batching setting. Would be grateful if you can help me out with these.

Is maximizing the tokens per batch and doing dynamic batching without any limit other than GPU memory on batch size ok for covnergence? Will having varied different batch sizes each step (but with constant tokens per batch) affect convergence?
Each instance of batch has tokens that are correlated. So would having long batches with few instances be "noisy" in terms of update they provide? Should we re-scale losses or do something to address this? Or should there be a cut-off on maximum instances per batch (using the lambda we provide).
When doing distributed data parallel in torch with data loading from infinibatch, each GPU might see a different batch size (though the tokens per batch might be the same) Should we take the number of instances per batch into account for syncing gradients?
Is there any rule of thumb with regards to hyperparameters when doing dynamic batching?
In transfer learning setting, would it be advisable to do this?