Clarification on pre-processing and data augmentation for pre-trained imagenet weights.

Question

Clarification on pre-processing and data augmentation for pre-trained imagenet weights.

JossWhittle opened this issue 5 years ago · comments

Hi, can I confirm the method used to obtain the pre-trained imagenet weights for each of the models currently available. Are these weights ported from the official TPU repo checkpoint files or are they from training the keras port from random initialization? Are they from the Standard Pre-processing checkpoints or the AutoAugment checkpoints?

Similarly, what image pre-processing do the weights expect here? This repository (https://github.com/titu1994/keras-efficientnets/blob/master/keras_efficientnets/efficientnet.py#L51) indicates that for the imagenet weights we should be using torch style pre-processing where we use RGB channel ordering and shift by the imagenet train set mean and divide by the imagenet train set std deviation. Is this correct?

Somshubra Majumdar · Answer 1 · Tue Sep 03 2019 22:38:31 GMT+0800 (China Standard Time)

I'll split the answer into sections as this is a bit difficult to explain otherwise.

Yes, the weights are ported from Tensorflow repository. They are not trained by me.
The weights are from the standard checkpoints, not the auto augment checkpoints.
They expect the Keras "Torch" mode preprocessing. This is evident from https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_builder.py#L29. This is what "torch" mode does in Keras (https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py#L47). The default "caffe" version does RGB->BGR mode which TF does not use.

As to your earlier issue of EfficientNets getting poorer scores for CIFAR 10 / 100 using this normalization scheme, that is to be expected. These mean and std values are for imagenet. They must be computed separately for CIFAR 10 or 100. I would rather suggest simply using "tf" mode normalization with /127.5 and -0.5 to get [-1, 1] range inputs for CIFAR.

Joss Whittle · Answer 2 · Tue Sep 03 2019 23:01:10 GMT+0800 (China Standard Time)

Thanks for the fast reply. I realized after I initially posted that I had mistaken the BGR->RGB conversion as being triggered by mode 'torch' when it was not.

If using 'tf' mode to normalize 0,255 to -1,1 does this not make the pre-trained weights in the first layer incorrectly scaled? If training only a new classifier head with frozen pre-trained conv weights this seems like it should harm accuracy.

Would you recommend feeding in 'tf' formatted Cifar-100 into the 'torch' formatted pre-trained model and then fine-tuning the new classifier and conv weights with everything un-frozen with a low learning rate? Or training the new classifier head with frozen conv weights, then un-freezing to fine-tune the whole model?

Lastly, is it correct to interpret this repository and the paper as saying that Cifar-10/100 32x32 images should be up-scaled to 224x224 resolution (via bilinear or bicubic) before they are passed to the B0 model, and similarly up to 600x600 resolution for B7?

Thank you kindly for your time.

Somshubra Majumdar · Answer 3 · Tue Sep 03 2019 23:15:29 GMT+0800 (China Standard Time)

What I meant by using TF mode for CIFAR was that if you were not using pretrained weights, that would be an easy alternative to computing the mean and std of CIFAR and using that for normalization. I should have made that clearer.

Also, I don't recommend using these EfficientNets for CIFAR at all. By upscaling, you are making the image extremely blurred out due to bilinear or bicubic upsampling from 32x32 to 224x224 or even more.

As to whether it is needed or not, I would have to say it is definitely needed. These models have 5 reduction stages, reducing the size by 2^5. Therefore at the end of the model, the spatial dimension of CIFAR is 1x1xC where C is the final number of channels. Due to such a small spatial resolution, most of the information useful for the model has already been lost.

Joss Whittle · Answer 4 · Tue Sep 03 2019 23:52:37 GMT+0800 (China Standard Time)

Thanks again for the clarification, I should add that I am trying to replicate the SOTA Cifar-100 accuracy. The details on how the transfer learning experiments were structured is omitted from the paper and the official TPU repo so I am experimenting to recreate these training configurations.

Blurring was my concern too at these large up-scaled resolutions but I cannot see how else they would have achieved this. If the model is pre-trained on full resolution 224x224 images from imagenet then the size of an object of interest (a cat or car for example) should take up roughly the same proportion of the image as the object does in the 32x32 Cifar images, albeit with greatly reduced detail which will effect the quality of the activations worse in the first few groups of layers before the reductions make it more manageable.

When feeding small images into the model, the reduction steps will as you say cause degenerate sized feature maps of size 1x1 in the later layers and I have observed this to cause NaNs and explosions in the gradient. But if the small Cifar images are fed in at 32x32 scale then the perceived pixel size of the object is also greatly reduced from what is expected by the pre-trained weights. i.e. A cat that was meant to be at least ~100 pixels across in imagenet is now only ~12 pixels across in cifar.

Somshubra Majumdar · Answer 5 · Tue Sep 03 2019 23:59:08 GMT+0800 (China Standard Time)

I didn't know this paper ran CIFAR tests. That's interesting. If they have not specified the details in the paper, the next course of action would be to mail the first author, while cc'ing the rest for additional details. You might get lucky and receive a reply soon enough.

Joss Whittle · Answer 6 · Wed Sep 04 2019 00:11:30 GMT+0800 (China Standard Time)

In table 5 they report transfer-learning results from imagenet for CIFAR-100 and other datasets.

EfficientNet-B0 achieves 88.1% val acc with 4M parameters, 21 times fewer parameters than NASNet-A which achieves 87.5% val acc with 85M parameters.

Similarly they compare EfficientNet-B7 which achieves 91.7% val acc with 64M parameters to Gpipe achieving 91.3% val acc with 556M params. A reduction of 8.7.

I think yes, I will write to the authors after my next test runs. As a side note, are you accepting pull requests? I have been working on a modified version of the model compatible with tensorflow 2.0 which required some changes to be made. :)

Somshubra Majumdar · Answer 7 · Wed Sep 04 2019 00:23:54 GMT+0800 (China Standard Time)

Yes I am accepting pull requests, but as Tensorflow 2 is still on beta, may I request you duplicate whatever script you wish to edit and prepend it with tf_ for it's name.

Joss Whittle · Answer 8 · Wed Sep 04 2019 00:29:26 GMT+0800 (China Standard Time)

Certainly. I was wondering if a tf2 branch would be appropriate.

Somshubra Majumdar · Answer 9 · Wed Sep 04 2019 00:32:08 GMT+0800 (China Standard Time)

Yes that would be great as well.

shairoz · Answer 10 · Tue Jan 21 2020 15:49:57 GMT+0800 (China Standard Time)

@JossWhittle were you able to reproduce the results of Cifar 10/100 ? if so, can you elaborate on the changes required?

Joss Whittle · Answer 11 · Tue Jan 21 2020 21:21:36 GMT+0800 (China Standard Time)

@shairoz I was not able to. I got to within 3% though on Cifar100.

shairoz · Answer 12 · Tue Jan 21 2020 21:41:42 GMT+0800 (China Standard Time)

Thank you @JossWhittle , do you mean within 3% of the reported value (i.e. ~87%)? if so, what changes did you make and what optimizer/hyper-param did you use?

Joss Whittle · Answer 13 · Wed Jan 22 2020 01:39:06 GMT+0800 (China Standard Time)

About that accurate, yes. I used the hyper parameters that were originally reported for transfer learning from ImageNet weights to Cifar100 in the original repo from the paper. These were designed for a TPU pod, and I was running on a pair of P6000 GPUs.

I scaled the Cifar100 images up to the scale that the B7 model was trained on for ImageNet using bicubic interpolation. I also experimented with whitening techniques like ZCA whitening which was used for the previous SOTA results on Cifar100 from the WideResNet paper.