Trained WRN-28-10 with batch size 64 (128 in paper).
Trained DenseNet-BC-100 (k=12) with batch size 32 and initial learning rate 0.05 (batch size 64 and initial learning rate 0.1 in paper).
Trained ResNeXt-29 4x64d with a single GPU, batch size 32 and initial learning rate 0.025 (8 GPUs, batch size 128 and initial learning rate 0.1 in paper).
Trained shake-shake models with a single GPU (2 GPUs in paper).
Trained shake-shake 26 2x64d (S-S-I) with batch size 64, and initial learning rate 0.1.
Test errors reported above are the ones at last epoch.
Experiments with only 1 run are done on different computer from the one used for experiments with 3 runs.
Results reported in the tables are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
Following data augmentations are applied to the training data:
Images are padded with 4 pixels on each side, and 28x28 patches are randomly cropped from the padded images.
Images are randomly flipped horizontally.
GeForce GTX 1080 Ti was used in these experiments.
Results on MNIST
Model
Test Error (median of 3 runs)
# of Epochs
Training Time
ResNet-preact-20
0.38
40
9m
ResNet-preact-20, Cutout 6
0.40
40
9m
ResNet-preact-20, Cutout 8
0.32
40
9m
ResNet-preact-20, Cutout 10
0.34
40
9m
ResNet-preact-20, Cutout 12
0.30
40
9m
ResNet-preact-20, Cutout 14
0.34
40
9m
ResNet-preact-20, Cutout 16
0.35
40
9m
ResNet-preact-20, RandomErasing
0.36
40
9m
ResNet-preact-20, Mixup (alpha=1)
0.39
40
11m
ResNet-preact-20, Mixup (alpha=1)
0.37
80
21m
ResNet-preact-20, Mixup (alpha=0.5)
0.33
40
11m
ResNet-preact-20, Mixup (alpha=0.5)
0.38
80
21m
ResNet-preact-20, widening factor 4, Cutout 12
0.29
40
40m
ResNet-preact-50
0.39
40
22m
ResNet-preact-50, Cutout 12
0.31
40
22m
ResNet-preact-50, widening factor 4, Cutout 12
0.29 (1 run)
40
1h40m
shake-shake-26 2x32d (S-S-I), Cutout 12
0.29
100
1h48m
Note
Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
GeForce GTX 980 was used in these experiments.
Experiments
Experiment on residual units, learning rate scheduling, and data augmentation
In this experiment, the effects of the following on classification accuracy are investigated:
PyramidNet-like residual units
Cosine annealing of learning rate
Cutout
Random Erasing
Mixup
Preactivation of shortcuts after downsampling
ResNet-preact-56 is trained on CIFAR-10 with initial learning rate 0.2 in this experiment.
Note
PyramidNet paper (1610.02915) showed that removing first ReLU in residual units and adding BN after last convolutions in residual units both improve classification accuracy.
SGDR paper (1608.03983) showed cosine annealing improves classification accuracy even without restarting.
Results
PyramidNet-like units works.
It might be better not to preactivate shortcuts after downsampling when using PyramidNet-like units.
Cosine annealing slightly improves accuracy.
Cutout, RandomErasing, and Mixup all work great.
Mixup needs longer training.
Model
Test Error (median of 5 runs)
Training Time
w/ 1st ReLU, w/o last BN, preactivate shortcut after downsampling
6.45
95 min
w/ 1st ReLU, w/o last BN
6.47
95 min
w/o 1st ReLU, w/o last BN
6.14
89 min
w/ 1st ReLU, w/ last BN
6.43
104 min
w/o 1st ReLU, w/ last BN
5.85
98 min
w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. link, arXiv:1512.03385
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Identity Mappings in Deep Residual Networks." In European Conference on Computer Vision (ECCV). 2016. arXiv:1603.05027, Torch implementation
Zagoruyko, Sergey, and Nikos Komodakis. "Wide Residual Networks." Proceedings of the British Machine Vision Conference (BMVC), 2016. arXiv:1605.07146, Torch implementation
Loshchilov, Ilya, and Frank Hutter. "SGDR: Stochastic Gradient Descent with Warm Restarts." In International Conference on Learning Representations (ICLR), 2017. link, arXiv:1608.03983, Lasagne implementation
Huang, Gao, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. "Densely Connected Convolutional Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. link, arXiv:1608.06993, Torch implementation
Xie, Saining, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. "Aggregated Residual Transformations for Deep Neural Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. link, arXiv:1611.05431, Torch implementation
Gastaldi, Xavier. "Shake-Shake regularization of 3-branch residual networks." In International Conference on Learning Representations (ICLR) Workshop, 2017. link, arXiv:1705.07485, Torch implementation
DeVries, Terrance, and Graham W. Taylor. "Improved Regularization of Convolutional Neural Networks with Cutout." arXiv preprint arXiv:1708.04552 (2017). arXiv:1708.04552, PyTorch implementation
Zhong, Zhun, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. "Random Erasing Data Augmentation." arXiv preprint arXiv:1708.04896 (2017). arXiv:1708.04896, PyTorch implementation
Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-Excitation Networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141. link, arXiv:1709.01507, Caffe implementation
Zhang, Hongyi, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. "mixup: Beyond Empirical Risk Minimization." In International Conference on Learning Representations (ICLR), 2017. link, arXiv:1710.09412
Recht, Benjamin, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. "Do CIFAR-10 Classifiers Generalize to CIFAR-10?" arXiv preprint arXiv:1806.00451 (2018). arXiv:1806.00451
Takahashi, Ryo, Takashi Matsubara, and Kuniaki Uehara. "Data Augmentation using Random Image Cropping and Patching for Deep CNNs." Proceedings of The 10th Asian Conference on Machine Learning (ACML), 2018. link, arXiv:1811.09030
About
PyTorch implementation of image classification models for CIFAR-10/CIFAR-100/MNIST/FashionMNIST/Kuzushiji-MNIST