liuzhuang13 / DenseNet

Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parameters and computation

ibmua opened this issue · comments

Hi there and great work! I've actually also figured out the very same concept myself prior to finding out you guys have already tested and published it. ✍(◔◡◔) Some of the design decisions I've made were different, so I'd like to compare.

Where you're reporting results on Cifars, if you could also add the number of parameters you are using and, possibly, an estimated amount of computation, that would be highly beneficial. It's really necessary for serious comparisons and ability to perfect even this very architecture. Also, if you could add your training logs that would also be of great insight.

As for how to measure the amount of computation, that's quite a tough thing to do, so I'd recommend to at least measure training time, which is a very inexact measure, but, well, provides at least some insights.

I've had 19.5% on Cifar-100+ with mean and std not adjusted (whole dataset just scaled to [0..1] values) with 24m params and forward+backwards running for 220 sec / epoch on GTX Titan X with the best dense-type architecture that I designed prior (I could only experiment on a single GTX Titan X, don't really have a lot of computational resources). It didn't have preactivation. It would most likely at least match the results those you've published for DenseNet (L=100, k=24) CIFAR-100+ if I used the right dataset (with std and mean adjusted). My code https://github.com/ibmua/Breaking-Cifar/blob/master/models/hoard-2-x.lua (uses 4-spaced tabs. To achieve that result I used depth=2 sequences=2 , here's a log of the end of training https://github.com/ibmua/Breaking-Cifar/blob/master/logs/load_59251794/log.txt ). Mind that I used groups, which are only accessible via Soumith's "cudnn", so if you'll want to try this you probably want to clone the whole thing. Also, not that I didn't use any Droupout (haven't even tried)

Oops, I see you've actually published # of params inside your paper.
Mind, btw, that Zagoruyko's numbers are wrong. He messed up the dataset szagoruyko/wide-residual-networks#17 (comment)

You can see that DenseNet performed worse with 7m params vs WideResnet with 2.7m on SVHN. I'm seeing it from your paper that that's because Sergey didn't mess up the data in that one, though, you both only did [0..1] scaling when std+mean would have probably done better. Can't say certainly, but likely.

On the other hand, one can always just go ahead and test the speed himself. =) Though, that's a last resort measure considering the need for that info to be in tables/figures.

Mind, btw, that Zagoruyko's numbers are wrong. He messed up the dataset szagoruyko/wide-residual-networks#17 (comment)

Thanks for your remind. Yeah, we read the wide-resnet paper, and knew that the wide-resnet's preprocessing (whitend instead of only normalized) and data augmentation(relect-padding instead of zero-padding) are both different from and slightly heavier than us. But ours is more widely used (see our references in the paper), it follows most of the publications.

We cannot rerun every baseline methods, and we think it's fair to compare our model with wide-resnet under the setting above.

We appreciate your remind on #parameters and #computation. We'll consider including #computation in our next updates.

For training time reference, our Densenet(L=40, k=12) with batch size 64 and 300 epochs takes about 7 hours to finish on one TITAN X GPU. This includes about 0.5 hour test time.

Also as you said, it's important to keep other settings the same when comparing different architectures. So we keep every hyperparameter and other setting the same with the official implementation of ResNet, and use standard preprocessing and data augmentations. We're interested in the performance comparison of your architecture with densenets, under the same setting. But note that it's possible that a set of hyperparameters and settings is good for one architecture but bad for another architecture.

Yes, of course you wouldn't rerun every model =) I'm just saying that for you to now know that some reevaluation of how good is DenseNet model vs ResNet is, or at least will soon be needed. Zagoruyko is going to retest his model, and, I'm guessing, update his paper accordingly.

Reguarding mirroring/reflections, fb.resnet.torch uses those too https://github.com/facebook/fb.resnet.torch/blob/master/datasets/cifar10.lua actually, for images of such types it's the most default kind of preprocessing, much more usual than zero-padding. For images of numbers and letters on the other hand, of course, it's not used at all in general. Except, maybe, for hand-selected parts of dataset.

our Densenet(L=40, k=12) with batch size 64 and 300 epochs takes about 7 hours to finish on one TITAN X GPU

Yes, I actually saw that info on the main page. =) It's just that it would be more interesting to see it in a figure with dots mapped on accuracy and #parameters (#computation in another one). I'll suggest that to Sergey too.

Selecting proper hyperparameters is indeed an important part and one should select best hyperparams for it when evaluating model's performance. Though it's also a part of model's evaluation to tell how critically is it's performance dependent on having some very exact hyperparameters.

for images of such types it's the most default kind of preprocessing, much more usual than zero-padding

It seems to me that fb.resnet.torch's code use zero-padding instead of reflect padding. https://github.com/facebook/fb.resnet.torch/blob/master/datasets/transforms.lua
And zero-padding is more usual than reflecting padding (see our references).

For SVHN dataset we followed wide resnets' preprocessing and data augmentation(no data augmentation).

Thanks for your information. Just to clarify we didn't play any tricks and tried to keep the comparison as fair as possible.

https://github.com/facebook/fb.resnet.torch/blob/master/datasets/cifar10.lua
Notice
t.HorizontalFlip(0.5),

As for what method is more popular - to tell the truth, I honestly don't know. Reflections just seem a more natural thing for this type of images and for convolutional networks with large pooling layer at the end like in all modern architectures. You see, for some images padding can remove an important part of an image and that padding you add adds some informational noise. I don't even understand how padding actually improves anything other than, maybe, giving more attention to the central part of the image and, maybe, helping learn to classify a cut part of an object as this object.

If you'll get a dataset where the classifier object is not located in the image center, but at many times near edges, my guess is zero padding will do you no good.

Meanwhile reflections can only hurt in some very rare cases when the object is not symmetric and when a symmetric reflection of it should classify as a different object. In real life that's almost completely restricted to symbols and characters. So it's the least risky and most obvious type of augmentation.

Scaling and rotation seem to me as much more meaningful augmentations than zero-padding for convolutional networks, but I guess for CIFAR they don't work very well, as the images there are very tiny. They should work to some extent if you're upscaling the images first, though. But that will make the network much slower and will probably defeat the purpose, as CIFAR is more of just a quick playground to test ideas than something like an actual data. If people were to just restrict themselves to testing on purely mirrored CIFAR without zero-padding that would be of great help, IMHO, as there's plenty of ways to zero-pad the dataset, but doing this makes no sense at all as the only purpose of CIFAR is just to compare ideas, not to win some competition. I'm saying we should keep reflections, though, because these definitely can't hurt and are a very tiny unambiguous type of augmentation that should help those networks that would otherwise tend to become overfit very fast, but are not inherently bad, as actual datasets are usually much bigger and have a possibility of rotation, scaling and other better augmentations. Meanwhile, my guess is that zero-padding will probably help them to a much smaller degree and is not a very fair type of augmentation as in CIFAR it almost completely eliminates any other things from the picture but the actual classifier-object when you combine the cuts made by zero-padding algo.

Okay, I just understood that we were talking about different things =D Reflection-padding means padding with non-0 pixels. Now I get it. Yeah, reflection-padding is likely not a very good type of augmentation, IMO. You're providing the net with a lot of garbage data. With zero-padding you do that as well, but it quickly learns that that data is garbage, while in this type of augmentation, I don't think it's that easy.

Zero padding means padding the image by zero and then cropping it to the be of original size. It's equivalent to a translation and then padding the other side with zeros. I think from this view it makes more sense.

I just confused this with image horizontal mirroring.

Yeah, I just found it that in WRN and so in my code it seems like we used reflection padding type. Lol.

I'll continue my investigations into CNNs with Cifar without using any type of padding.

As for how any type of zero-padding influences learning, I guess, it tells that patterns from the cut-out parts should probably be generally ignored and that the most important part of the object is in those patterns that are located in the middle. Quite a lot of additional info.

@ibmua regarding zero vs reflection padding you might also have a look at their opinion https://twitter.com/karpathy/status/720622989289644033

As far as I see it, with zero padding you're effectively adding some gray color instead of unimportant part of an image, therefore making classifier more indifferent to the cut-out part and to such large gray color patches. With reflection-padding you're adding very risky info. You're still reaping the benefit of making classifier more indifferent to the cut-out part, but the info you're adding will be impressed into weights. Likely, much more than with zero-padding. Yeah, maybe, for some cases it works better. Because it adds indifference to some background-ish patterns. But it's risky, IMHO.

Anyways, not cutting the image altogether is more interesting to me from a perspective of evaluation of NN quality. IMHO, it's probably only reasonable to use it if you're cooking the final net for production. It's pretty orthogonal to the network model itself. This is just a way of using your knowledge about the exact dataset to make it universally easier to grasp by any type of a regression model.