How long to train the model?

Question

How long to train the model?

cxy7tv opened this issue 5 years ago · comments

Hi, I try to train the model from scratch in my device.
Args are as followed:
--model = FlowNetS --n_epoch=8 --batch_size=128 --num_workers 32
And my device is composed of four 2080ti.
I find that it takes about 25 minutes to run an epoch, which means about 5 days needed to run 300 epochs according to default settings.
I wonder how long is the training procedure of flownets_EPE1.951.pth.tar in the training results.
Looking forward to your reply!

Stefano Savian · Answer 1 · Tue Dec 17 2019 18:45:35 GMT+0800 (China Standard Time)

Maybe I can answer this. FlowNetC, 300 epochs with 16 cores on a single Nvidia tesla takes between 1 and 2 days.

Please take into account that the low endpoint error on flyingchairs might be due to a little overfitting. When testing on Sintel I always obtain lower performance.
Secondly I have noticed that larger linear displacements can lead to a better generalization (tried on Sintel). However this requires to implement a custom data augmentation

Clément Pinard · Answer 2 · Tue Dec 17 2019 20:23:36 GMT+0800 (China Standard Time)

The 300 epochs is set for a batch of 8 images.

With your batch of 128, to get the same amount of data, you would need only about 20 epochs (300/16)

But as @jeffbaena pointed out, the training usually takes 1 or 2 days (Personally tested it with a 980Ti) with regular parameters.

Finally, I you want to apply heavier data augmentation, you can change the parameters here and put higher values. (But keep in mind the rotation angle must be low, because it uses the approximation sin(x) ~ x)

cxy7tv · Answer 3 · Sat Dec 28 2019 21:43:24 GMT+0800 (China Standard Time)

Thank you all for your experience on training and advises on data augmentation, @jeffbaena @ClementPinard
Finally, I found that I assigned epoch-size as 2700, the reason that took so long to train an epoch.