Reproducing the Results, and Questions

Question

Reproducing the Results, and Questions

drcdr opened this issue 5 years ago · comments

I am trying to reproduce the results of PDARTS, which looks like it provides awesome performance, congratulations!

Everything here is CIFAR-10. I didn't make any significant source-code mods; all other arguments are the default based on the repository on Apr 30. (I did hard-code directory names.)

Here are the labels for what I ran (Windows-10, Pytorch-nightly from 4/30/2019, 2xTitanXP):
1) PDARTS: Just train, rerunning the (default) PDARTS genotype in genotypes.py:

python train_cifar.py --cutout
Final validation error: 2.76% (best 2.69%, epoch 533) (3.43M parameters)
Runtime: about 3 days

2) pdarts-BS64: Search and train, but using Batch-Size=64 since TitanXP is memory-limited.

python train_search.py --add_layers 6 --add_layers 12 --dropout_rate 0.1 --dropout_rate 0.4 --dropout_rate 0.7 --batch_size 64
Runtime: about 25 hours
place the result into genotypes.py. I used the first one, not the 5 after the 'Restricting skipconnect...'
This was: pdarts64 = Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 2), ('skip_connect', 0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('avg_pool_3x3', 1), ('skip_connect', 1), ('sep_conv_5x5', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 3)], reduce_concat=range(2, 6))
python train_cifar.py --arch=pdarts64 --cutout
Final validation error: 3.2% (best 3.15%, epoch 598) (2.8M parameters)
Runtime : about 3 days, 16 hours (final val error: 3.2%; best, 96.85%, epoch 598)

Some Questions

The difference between my 2.76% and your 2.5% seems significant. Any ideas why this might be? Are you reporting best-val-error, or val-error-epoch-599? Are you reporting the best error over multiple runs, or just one run; or, the mean over N runs (if so, what's N)?
What is the idea behind 'Restricting skipconnect'?
How do my timings compare with what you (or others) might be getting on TitanXP cards? Are there any optimizations you suggest? I may try train_cifar.py with only one GPU next.
Is the 600 epochs and Cosine Annealing absolutely necessary to achieve the advertised CIFAR-10 performance? I see DARTS and derived papers use this. It's fantastic to now have fast NAS, but when the train is 3x the search...
Do you expect BS>=128 (if memory was available) would improve PDARTS even more? I actually was able to run BS=96 on the first pass (the 'num_to_keep' loop), but not the second pass. What are your thoughts on having a different BS per pass?
I see you say you got search to run in 12 hours on a 1080-Ti, BS=64. That's twice as fast as what I got. What was your command line?
Just wondering, have you considered trying the Cosine Power Annealing approach? See https://arxiv.org/abs/1903.09900, equation 2, as well as the discussion there about the benefits.

Well, that's enough questions for now, I appreciate your time and consideration.

For reference, here is a plot of the learning rate and validation error for these two runs. The bold line is the result of filtfilt with a window filter of length 25.

Lingxi Xie · Answer 1 · Sat May 11 2019 08:56:23 GMT+0800 (China Standard Time)

Hi @drcdr , thank you for your interests in our work and so many good questions. I will try to answer a few of them and Xin will later put comments on some technical details.

1/3. Will be answered by Xin. The difference between 2.76% and 2.50% is a bit significant indeed.

By restricting the number of skip-connects we can make the searched architecture more stable. This operation is parameter-free, so we expect too many of them can bring negative effects in the real training stage. BTW, the architecture you searched has 5 skip-connects, which is the main reason of its poor performance.

5/6. BS seems an important issue in both speed and stability. We will try to make more experiments as soon as possible. Currently we only ran on our V100 GPUs and estimated the time on 1080Ti, which seems less accurate. Also, there are some evidences we met that suggests the importance of BS. We will provide some solutions for 12GB GPUs later.

4/7. For fair comparison, we did not change this setting. Another information is that we only need 1 day (1 V100) to train CIFAR10/100 on a searched architecture. Maybe Xin knows more about why you need 3 days.

We are very welcome for your further questions and comments.

chenxin061 · Answer 2 · Sat May 11 2019 17:55:42 GMT+0800 (China Standard Time)

Hi@drcdr, thanks for the comments. The following are some technical details of our experiments.

Thanks for noticing the --cutout term. As mentioned in the README file, you should add the term --auxiliary to enable an auxiliary loss tower. Besides, we use a single GPU to do the evaluation. As far as I know, you may get a different test accuracy when trained with more than one GPU. Our 2.50% test error is an average of 3 runs, among which the best one is 2.42%.
About 40 hours on a single P100 and 24 hours on a single V100. My suggestion is doing the training on a single GPU.
The 12 hours search time on 1080-Ti is our estimation according to the previous experiments. My colleague told me that he finished the search process with a 1080-Ti within 7 hours, which has been updated into the README file. The command line is exactly the same as which in the README file.
I notice that your os is Windows, which is a significant difference between your and our environment.
I did not try the power cosine annealing but I tried a 1000 epochs cosine annealing, which further boosts the performance. It may answer the question of the necessity of the 600-epoch schedule.
Hope the above helpful.

drcdr · Answer 3 · Mon May 13 2019 11:10:03 GMT+0800 (China Standard Time)

Hi @198808xc, @chenxin061 - thanks for the great, detailed responses. Based on this, I'll do some more investigation on my side, it may take a week or two - then, I'll follow up with what I find. Thanks

drcdr · Answer 4 · Tue May 14 2019 07:03:02 GMT+0800 (China Standard Time)

I have a related question: how are you calculating the number of trainable parameters in the model? I wrote a quick utility, and I matched your 3.4M number for PDARTS when --auxiliary is False, but I get a higher number (3.91M) when --auxiliary is True:

Arch=   RESNET-50   # Parameters = 25557032
Arch=      PDARTS   Auxiliary=False  #Parameters = 3433798
Arch=      PDARTS   Auxiliary= True  #Parameters = 3910224
Arch=    pdarts64   Auxiliary=False  #Parameters = 2800270
Arch=    pdarts64   Auxiliary= True  #Parameters = 3276696

import torch
from torch import nn
from torchvision import models
import genotypes
import collections
from model import NetworkCIFAR as Network

# https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())  # all of the params
    #return sum(p.numel() for p in model.parameters() if p.requires_grad)  # the trainable params

# just for reference
a = models.resnet50(pretrained=False)
count = count_parameters(a)
print ('Arch=%12s   # Parameters = %d' % ('RESNET-50', count))

args = collections.namedtuple('Args', ['init_channels', 'layers', 'arch', 'auxiliary', 'save'])
args.init_channels = 36
args.layers = 20
args.save = '.'
CIFAR_CLASSES = 10

for arch in ['PDARTS', 'pdarts_bs64']:
    for auxiliary in [False, True]:
        genotype = eval("genotypes.%s" % arch)
        model = Network(args.init_channels, CIFAR_CLASSES, args.layers, auxiliary, genotype)
        count = count_parameters(model)
        print ('Arch=%12s   Auxiliary=%5s  #Parameters = %d' % (arch, auxiliary, count))

chenxin061 · Answer 5 · Tue May 14 2019 22:29:36 GMT+0800 (China Standard Time)

@drcdr Yes if the auxiliary tower is included, the parameter count will be larger. However, the auxiliary tower is used for network training instead of testing. Therefore we do not take those extra parameters into consideration for the testing phase. Actually, you will get the same test accuracy without --auxiliary term. You need to modify some code lines to adjust the absence of --auxiliary term in the model loading part.

drcdr · Answer 6 · Thu May 16 2019 05:25:01 GMT+0800 (China Standard Time)

well, for some reason PyTorch crashed at iteration #551, with --auxiliary. Trying to figure out if warm restarts can be easily implemented. Looks like just CosineAnnealingLR() and torch.optim.SGD() would be affected (as well as torch.load'ing the checkpoint, and setting up the model from the state_dict)?

chenxin061 · Answer 7 · Thu May 16 2019 10:41:31 GMT+0800 (China Standard Time)

Yes, you can recover the training from the checkpoint saved in the --save path just like what you said above.

drcdr · Answer 8 · Fri May 17 2019 03:44:20 GMT+0800 (China Standard Time)

@chenxin061 OK, here's an update (thanks for your feedback).
Modifications:

train_cifar.py: optionally resume from a checkpoint; added CosineAnnealing support for this too; added support for single-GPU training
test.py: modified to ignore auxiliary_head* weights, if auxiliary == false

Results:

Due to --auxiliary, not enough memory for BS=128, I dropped down to BS=96,
I got better results: 2.56% final test error, but not 2.42% final test error that you got. It did reach 2.45% for a couple of epochs before, but crept up at the end. I don't think this is due to the resume-from-checkpoint.
Gray shading: means not done by me. White cells are based on results that I got.
I may add to this table later. Problems that I'd still like to figure out: why search seems so much slower on my machine; whether dropping the skip-connects will help, for the architectures I found. Might look into Cosine Power Annealing, too.

chenxin061 · Answer 9 · Fri May 17 2019 18:08:46 GMT+0800 (China Standard Time)

@drcdr I think the experimental results you got on evaluating CIFAR10 is acceptable.

For one reason, a different batch size may lead to a different set of optimal hyper-parameter, resulting in a slightly different performance.
Besides, it is quite common that the test accuracy ripples among different runs on CIFAR10. The average test error we got among 3 runs is 2.50% and the 2.56% test error you got for a single run is quite close to ours. The checkpoint we released with 2.42% test error is one of the best of our models.
For search cost:
My colleague updated the search speed from 7 hours to 11 hours on a single 1080Ti GPU. The result I attached in my previous reply is with a different configuration.
The evaluation speed on a single 1080Ti GPU is about 300s/epoch, resulting in a total cost of about 2 days. I guess the bottleneck of your system may be on the memory or CPU since the 1080Ti GPU and the Titan XP GPU seem to perform similar according to some previous report.

Xuanyi Dong · Answer 10 · Mon May 20 2019 16:08:30 GMT+0800 (China Standard Time)

Hi, @chenxin061 I'm reproducing your ImageNet results. I trained your model based on DARTS codes, here is my training log and model file: https://drive.google.com/open?id=1br4IPnHCV-zUHJkEGXPwXnsl6288yhFy , while the final accuracy is 73.92%. I double checked our codes, the difference is that you use the cosine decayed LR scheduler, while I use the StepLR following DARTS. I use batch size of 256, start LR from 0.1, and 8 GPUs. While you use 1024 batch size and start LR as 0.5. Did you try to train your model with StepLR scheduler, and how is the performance?

drcdr · Answer 11 · Mon May 20 2019 23:06:13 GMT+0800 (China Standard Time)

@D-X-Y I haven't tried Imagenet training yet. Am I reading/understanding this right; did your 250 epoch Imagenet training take 11 days, using 8 GPUs?! Also, it looked like you used the PDARTS genotype, so I guess you were trying to see how your run compared to the 24.4% top-1 test error number? (Also, I guess your batch-size-per-GPU was only 32?)

chenxin061 · Answer 12 · Tue May 21 2019 12:32:48 GMT+0800 (China Standard Time)

@D-X-Y We did not try the StepLR scheduler for the PDARTS genotype. The results reported in our paper were obtained with the linear scheduler, and we also obtained similar test accuracy with cosine scheduler. We are re-training the DARTS genotype with linear and cosine scheduler and will later report the test accuracy here and in the next version of our paper.

Xuanyi Dong · Answer 13 · Tue May 21 2019 13:13:39 GMT+0800 (China Standard Time)

@drcdr Yes, 8 GPUs, batch-size-per-GPU was 32. I'm trying to get 24.4% top-1 test error.

@chenxin061 Thanks for your reply and also look forward to your results. I will also try DARTS using your training strategy after NIPS ddl :)

chenxin061 · Answer 14 · Wed May 29 2019 21:18:01 GMT+0800 (China Standard Time)

@D-X-Y @drcdr Sorry for the late reply.
An update of results on ImageNet of DARTS:
Cosine scheduler: top1/top5 test error 25.3%/7.8%,
Linear scheduler: top1/top5 test error 25.4%/8.0%.

Xuanyi Dong · Answer 15 · Wed May 29 2019 22:13:06 GMT+0800 (China Standard Time)

@chenxin061 Thanks for your results! I'm also training DARTS and other NAS models with cosine scheduler.

Margrate · Answer 16 · Fri Jul 19 2019 13:55:32 GMT+0800 (China Standard Time)

I got acc95.95% using PDARTS in genotypes.py without change anything. (GPU:Tesla_V100-SXM2-32G)

chenxin061 · Answer 17 · Sat Jul 20 2019 10:08:05 GMT+0800 (China Standard Time)

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

Margrate · Answer 18 · Sat Jul 20 2019 14:35:24 GMT+0800 (China Standard Time)

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

chenxin061 · Answer 19 · Wed Jul 24 2019 11:41:30 GMT+0800 (China Standard Time)

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

I think there must be some hidden difference. The expected valid acc is about 97.50 with the correct setting. You can also refer to issue #9, where the retraining valid_acc reported in the issue reached 97.52% at epoch 557.

Arash Vahdat · Answer 20 · Thu Sep 12 2019 00:43:35 GMT+0800 (China Standard Time)

It seems the genotype PDARTS in this line is different than the one reported in figure 3(c).

Can you confirm that that the released genotype (above) was giving you 97.5%?

drcdr · Answer 21 · Fri Sep 13 2019 04:56:21 GMT+0800 (China Standard Time)

@arash-vahdat For me, see the PDARTSAux96 line in the table above (from May16). My final error there was 2.56%, and the genotype that I was using was the following, which looks the same as what you are referencing:

PDARTS = Genotype(normal=[('skip_connect', 0), ('dil_conv_3x3', 1), ('skip_connect', 0),('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3), ('sep_conv_3x3',0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('sep_conv_5x5', 1), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('max_pool_3x3', 0), ('dil_conv_3x3', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 3)], reduce_concat=range(2, 6))

chenxin061 · Answer 22 · Fri Sep 13 2019 10:56:48 GMT+0800 (China Standard Time)

@drcdr Thanks for the reproduction.
@arash-vahdat The genotype in figure 3(c) is from another run for ablation study and not the same as the genotype in genotypes.py. In our experiment, the one in genotypes.py got an average test acc of 97.50% among 3 runs, while the best one is 97.58%.