weiaicunzai / pytorch-cifar100

Practice on cifar100(ResNet, DenseNet, VGG, GoogleNet, InceptionV3, InceptionV4, Inception-ResNetv2, Xception, Resnet In Resnet, ResNext,ShuffleNet, ShuffleNetv2, MobileNet, MobileNetv2, SqueezeNet, NasNet, Residual Attention Network, SENet, WideResNet)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

model better than torchvision model on cifar100

Shiro-LK opened this issue · comments

Hi,

I thank you for your repo.
I tried to compare the accuracy of your resnet18 with the torchvision one, I do not understand why without pretrained, your model gives 70+ accuracy but the torchvision only 55. Have you implemented something specific for the cifar dataset ? (cifar100)

I just spend the last few days debugging this before finding your post, I also can't explain why there is such a difference in the performance

OK I figured it out actually, the original resnet, as implemented by pytorch, uses a 7x7 convolution in the first layer and does max pooling. This repository modifies that to make it a 3x3 convolution and skips max pooling altogether. My guess is that with 32x32 images, the 7x7 with so much padding and the max pooling are messing up the result.

This is just an intuition. The torchvision models were specifically made for imagenet images which are 224X224. Normally all architectures made for imagenet follows the same procedures in the first two-three layers. Bigger kernels followed by max-pooling. This is done, I guess to drastically decrease the number of parameters. As, @Queuecumber pointed out, the images in cifar100 are 32X32 so those architectures dont require that extensive downsampling with bigger kernels and pooling layers for this dataset and thus we come to this repository where @Queuecumber mentioned the skipping of the first unnecesary layers.

I've never trained torchvision's resnet18 on cifar100, but your question is very similar to this one #22 , and you could see my answer to his question: #22 (comment). Hope this would help you.

I used ResNet50 to train my model,and I also meet the same situation. Just as the owner's answer on #22,maxpool is a reason.And if you print two models,you will also find the kernal size of first covlutional layer is different.77 is too big for 3232 images.

My explanation:

7*7 kernels with stride 2 and MaxPool (Mentioned in the original paper and implemented in the torchvision model) reduce the effective number of features in 32*32 images of CIFAR dataset drastically. This leads to severe overfitting.

In the implementation provided in this repository, the initial conv layer (originally 7*7, stride 2) is replaced by a 3*3 , stride = 1 conv layer and the MaxPool is removed. This allows the remaining residual blocks to leverage a larger number of features and thus reduces overfitting.

It has one disadvantage though, it occupies nearly seven times the memory of original model on a GPU (This was tested for 224 sized images, for the 32 sized images, this factor may be different)