about effective batch size

Question

GoldExcalibur opened this issue 2 years ago · comments

Thanks for your excellent work and released code.

I have questions about the effective batch size, which is batch size 128 * accumulated_grad_batch 16= 2048.
Does this mean the model see 128 sample a time and then calculates the gradient, then add all the gradients for each of 16 batches? I think such type of implementation differs from the normal concept of batch size 2048, where model sees 2048 sample at a time and the InfoNCE loss is computed over all 2048 samples, but not over 128 samples.
Besides, I find that the precision is chosen to be 16 bit. I wonder why is it necessary to not use 32 bit.
In src/models/base_model.py, I find that the warmup_epochs and max_epochs is rescaled by a factor of self.train_iters_per_epoch // self.config.num_of_mini_batch. Why is this rescaling necessary ? If this factor does not equal to 1, the max_epochs in learning rate scheduler does not equal to max_epochs in pl trainer, which I think is not quite reasonable.