titu1994 / keras-efficientnets

Keras Implementation of EfficientNets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification on pre-processing and data augmentation for pre-trained imagenet weights.

JossWhittle opened this issue · comments

Hi, can I confirm the method used to obtain the pre-trained imagenet weights for each of the models currently available. Are these weights ported from the official TPU repo checkpoint files or are they from training the keras port from random initialization? Are they from the Standard Pre-processing checkpoints or the AutoAugment checkpoints?

Similarly, what image pre-processing do the weights expect here? This repository (https://github.com/titu1994/keras-efficientnets/blob/master/keras_efficientnets/efficientnet.py#L51) indicates that for the imagenet weights we should be using torch style pre-processing where we use RGB channel ordering and shift by the imagenet train set mean and divide by the imagenet train set std deviation. Is this correct?

I'll split the answer into sections as this is a bit difficult to explain otherwise.

  1. Yes, the weights are ported from Tensorflow repository. They are not trained by me.
  2. The weights are from the standard checkpoints, not the auto augment checkpoints.
  3. They expect the Keras "Torch" mode preprocessing. This is evident from https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_builder.py#L29. This is what "torch" mode does in Keras (https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py#L47). The default "caffe" version does RGB->BGR mode which TF does not use.

As to your earlier issue of EfficientNets getting poorer scores for CIFAR 10 / 100 using this normalization scheme, that is to be expected. These mean and std values are for imagenet. They must be computed separately for CIFAR 10 or 100. I would rather suggest simply using "tf" mode normalization with /127.5 and -0.5 to get [-1, 1] range inputs for CIFAR.

Thanks for the fast reply. I realized after I initially posted that I had mistaken the BGR->RGB conversion as being triggered by mode 'torch' when it was not.

If using 'tf' mode to normalize 0,255 to -1,1 does this not make the pre-trained weights in the first layer incorrectly scaled? If training only a new classifier head with frozen pre-trained conv weights this seems like it should harm accuracy.

Would you recommend feeding in 'tf' formatted Cifar-100 into the 'torch' formatted pre-trained model and then fine-tuning the new classifier and conv weights with everything un-frozen with a low learning rate? Or training the new classifier head with frozen conv weights, then un-freezing to fine-tune the whole model?

Lastly, is it correct to interpret this repository and the paper as saying that Cifar-10/100 32x32 images should be up-scaled to 224x224 resolution (via bilinear or bicubic) before they are passed to the B0 model, and similarly up to 600x600 resolution for B7?

Thank you kindly for your time.

What I meant by using TF mode for CIFAR was that if you were not using pretrained weights, that would be an easy alternative to computing the mean and std of CIFAR and using that for normalization. I should have made that clearer.

Also, I don't recommend using these EfficientNets for CIFAR at all. By upscaling, you are making the image extremely blurred out due to bilinear or bicubic upsampling from 32x32 to 224x224 or even more.

As to whether it is needed or not, I would have to say it is definitely needed. These models have 5 reduction stages, reducing the size by 2^5. Therefore at the end of the model, the spatial dimension of CIFAR is 1x1xC where C is the final number of channels. Due to such a small spatial resolution, most of the information useful for the model has already been lost.

Thanks again for the clarification, I should add that I am trying to replicate the SOTA Cifar-100 accuracy. The details on how the transfer learning experiments were structured is omitted from the paper and the official TPU repo so I am experimenting to recreate these training configurations.

Blurring was my concern too at these large up-scaled resolutions but I cannot see how else they would have achieved this. If the model is pre-trained on full resolution 224x224 images from imagenet then the size of an object of interest (a cat or car for example) should take up roughly the same proportion of the image as the object does in the 32x32 Cifar images, albeit with greatly reduced detail which will effect the quality of the activations worse in the first few groups of layers before the reductions make it more manageable.

When feeding small images into the model, the reduction steps will as you say cause degenerate sized feature maps of size 1x1 in the later layers and I have observed this to cause NaNs and explosions in the gradient. But if the small Cifar images are fed in at 32x32 scale then the perceived pixel size of the object is also greatly reduced from what is expected by the pre-trained weights. i.e. A cat that was meant to be at least ~100 pixels across in imagenet is now only ~12 pixels across in cifar.

I didn't know this paper ran CIFAR tests. That's interesting. If they have not specified the details in the paper, the next course of action would be to mail the first author, while cc'ing the rest for additional details. You might get lucky and receive a reply soon enough.

In table 5 they report transfer-learning results from imagenet for CIFAR-100 and other datasets.

EfficientNet-B0 achieves 88.1% val acc with 4M parameters, 21 times fewer parameters than NASNet-A which achieves 87.5% val acc with 85M parameters.

Similarly they compare EfficientNet-B7 which achieves 91.7% val acc with 64M parameters to Gpipe achieving 91.3% val acc with 556M params. A reduction of 8.7.

I think yes, I will write to the authors after my next test runs. As a side note, are you accepting pull requests? I have been working on a modified version of the model compatible with tensorflow 2.0 which required some changes to be made. :)

Yes I am accepting pull requests, but as Tensorflow 2 is still on beta, may I request you duplicate whatever script you wish to edit and prepend it with tf_ for it's name.

Certainly. I was wondering if a tf2 branch would be appropriate.

Yes that would be great as well.

@JossWhittle were you able to reproduce the results of Cifar 10/100 ? if so, can you elaborate on the changes required?

@shairoz I was not able to. I got to within 3% though on Cifar100.

Thank you @JossWhittle , do you mean within 3% of the reported value (i.e. ~87%)? if so, what changes did you make and what optimizer/hyper-param did you use?

About that accurate, yes. I used the hyper parameters that were originally reported for transfer learning from ImageNet weights to Cifar100 in the original repo from the paper. These were designed for a TPU pod, and I was running on a pair of P6000 GPUs.

I scaled the Cifar100 images up to the scale that the B7 model was trained on for ImageNet using bicubic interpolation. I also experimented with whitening techniques like ZCA whitening which was used for the previous SOTA results on Cifar100 from the WideResNet paper.