Epochs chosen different than the paper

Question

Epochs chosen different than the paper

PabloRR100 opened this issue 6 years ago · comments

The train set has 45.000 images.
Taking into account that the BS = 128, that would yield to 352 iterations / epochs.
In the paper they train the network for 64000 iterations, which results in 181 epochs of training.

Please, let me know if you agree

akamaster · Answer 1 · Fri Feb 01 2019 09:01:48 GMT+0800 (China Standard Time)

Yes, I agree, partially. In this code, there is no TRAIN/VAL split, therefore train set is of 50k images => 390 iterations/epoch with batch-size 128, therefore total iterations required to match paper should be 165 epochs, with milestones at 81, 123, 164. The pretrained networks in reop were generated with total number of 200 epochs with milestones at 100, 150 and 200.

Pablo Ruiz · Answer 2 · Wed Feb 06 2019 17:44:07 GMT+0800 (China Standard Time)

Thanks for the reply.

Few questions:

If you don't have a validation set, how are you making sure you are not overfitting at some point?
I have done a fair implementation as well but my problem is that my training accuracy reach 100% too early (about 100 epochs) and I suspect there is not a lot of room for improvement in the test set, and that's why I could be having a minimum error in the test set of 7%.
Do you have a similar thing for DenseNets? I have coded them to also match the paper suggestions but it seems the feature maps of densenets are so big that they required either a huge machine or a more careful implementation.
In fact, paper's authors have another paper for implementations of DenseNets with a repo in PyTorch available here: https://github.com/gpleiss/efficient_densenet_pytorch. However, this still breaks for me.

Thanks!

David W. Romero · Answer 3 · Tue May 28 2019 23:13:12 GMT+0800 (China Standard Time)

Hey, I have the same commentary as Pablo. So, are you using the test set as validation set?
I was just looking at the paper and they state the following:

"We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split" --> I think this part is indeed lacking on your implementation. I would be happy to add that if you like :)

Cheers,
David

kirk86 · Answer 4 · Tue Aug 27 2019 22:47:15 GMT+0800 (China Standard Time)

Most repos I've seen with pretrained models they all overfit the test set. That's another reason why the numbers look so good. At the minimum there's should be a split of train/val/test.

Pablo Ruiz · Answer 5 · Tue Aug 27 2019 23:22:50 GMT+0800 (China Standard Time)

Hi @kirk86

Saying overfit on the test set does not make sense right?
Since the model is not "seeing" (or trying to fit) the test data, it can not overfit it.

Cheers,
Pablo

kirk86 · Answer 6 · Tue Aug 27 2019 23:36:36 GMT+0800 (China Standard Time)

Hi @PabloRR100,
true the model is not trying to fit the test data directly but think why we use the validation set in the first place? IMHO the validation set is to control the bias/variance tradeoff and based on that you modify your model. Now how exactly you're not overfitting the test set if you use that to modify your model based on the bias/variance tradeoff? Again IMHO the test set should be untouched at all times but exposed only once in the end after the model has been trained to evaluate its generalization capabilities.

akamaster · Answer 7 · Sat Oct 05 2019 08:11:56 GMT+0800 (China Standard Time)

Dear @PabloRR100 and @kirk86, you are both right. However, in current deep learning, even if you do use validation to control bias/variance tradeoff, since everyone publishes better results, it implicitly means optimizing (looking into) over test data. Clearly, if model doesn't improve over test, then no-one would publish it, therefore, whenever something 'better' appears, it is necessarily over fitting the test data.