tf.contrib.slim.learning.train() relationship between batch size, steps, and epochs?

Question

tf.contrib.slim.learning.train() relationship between batch size, steps, and epochs?

mattroos opened this issue 5 years ago · comments

I'm having trouble finding answers to this in the TF documentation or code. What is the definition of a 'step' in the context of tf.contrib.slim.learning.train()? The ICDAR2015 dataset has 1000 training images. Does a step mean that 1000 images were processed (an 'epoch,' in most people's terminology)? Or that a single batch (e.g., 3*24 in the Pixel Link paper) was processed? Or something else?

Relatedly, if I'm using a single GPU with IMG_PER_GPU set to 16, then the batch size will be 16. The pretrained model was first trained on 100 steps with learning rate of 1e-3 and a batch size of 72. What should I set the number of step to with my single GPU and batch size of 16, to get the equivalent number of images train during this initial learning rate part of the training?

Matthew Roos · Answer 1 · Wed Jun 19 2019 20:57:33 GMT+0800 (China Standard Time)

Speculating on an answer to my own question, based on the docstring for the train() function, a step is a gradient step, e.g., one update of the parameters based on the loss for a batch. So to get training somewhat equivalent to 100 steps on 3 GPUs at 24 images per GPU, using only 1 GPU and 16 images per GPU instead, we'd need to execute (3*24)/(1*16)*100 = 450 steps. In that case we'll have trained on the same number of samples as for the 3 GPU case. Of course, results could be quite different since we'll have made 4.5x more gradient updates (steps), albeit with noisier gradients (in some sense, due to the smaller batch size).