tianyic / only_train_once_personal_footprint

OTOv1-v3, NeurIPS, ICLR, TMLR, DNN Training, Compression, Structured Pruning, Erasing Operators, CNN, Diffusion, LLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

setting on imagenet

mountain111 opened this issue · comments

According to the optimizer setting for resnet18 on cifar10 in tutorials, I apply the following setting for resnet50 on imagenet
optimizer = oto.dhspg(
variant='sgd',
lr=0.1,
target_group_sparsity=0.4,
weight_decay=1e-4,
start_pruning_steps=50 * len(trainloader), # start pruning after 50 epochs
epsilon=0.95)

However, I get a lower accuracy of about 68. What should I change for resnet50 on imagenet? e.g., epsilon, starting_pruning_steps?

@mountain111 Thanks for reaching out. The acc 68 is low that seems like you started from scratch? If so, I recommend to start from the pretrained resnet50, which is a common setup for other pruning works as well. In general, OTOv2 is flexible to start from scratch or pretrained checkpoint.

If train from scratch, I would recommend to ensure the training pipeline could reach desired acc without pruning at first. Once the baseline full model could be trained well, users could then start another round from scratch to engage with pruning. Due to your starting pruning at a late stage, I suspect that your baseline full model training may be distant from achieving the desired accuracy, i.e., 76% for full resnet50 on imagenet which will cause trouble and needs to be addressed first.

To train without pruning by DHSPG, since DHSPG is a hybrid optimizer, if no pruning is triggered, it will perform exactly the same as the baseline variant optimizer. One way of disabling pruning is set target_group_sparsity=0.0 or start_pruning_steps=MAXIMUM_EPOCH.

Meanwhile, epsilon does not matter if group sparsity reached desired level. starting_pruning_steps should be set as 15 or 30 since the lr decay period for imagenet is 30.

More detailed tutorials including this experiment will be released to the public this summer after we complete a few time-sensitive projects.

@tianyic Thanks for your response.

pretrained model with accuracy 76.15,
lr=0.1,
target_group_sparsity=0.4,
starting_pruning_steps=30,
8GPUs,
epochs=120.
When the above setups are applied, the accuracy of 74.17 is attined and still lower than the reported around 75.5.

The reproduced result is confusing me. What should I do to improve it?

Thanks!

@mountain111

Glad that you have boosted the accuracy via starting from the pretrained checkpoint. Your reproduced result has about 1.1% gap between the reported, which could be recapped via more training tricks such as data augmentation, smooth labeling and a little randomness. The imagenet training indeed requires a bit sophiscated training pipeline to get desired result 75%+. But in my memory, the results under 60% and 70% group sparsity are not that sensitive for the tricks. You could try them to see the difference.

It is a bit hard to precisely point out where to start without the code. As mentioned, we will release more tutorials including this experiment once we are free from time-sensitive projects this summer. At the present, you could share your training script for me tiachen@microsoft.com, I will take a look if I have bandwidth but frankly not guarantee by the early of June.

@tianyic Thanks for your quick reply!

The code I apply is referred to https://github.com/pytorch/examples/blob/main/imagenet/main.py, where normalization, transforms.RandomResizedCrop(224), and transforms.RandomHorizontalFlip() are adopted.

@tianyic Hi, Tianyi
Could you please tell me the exact accuracies for 50%/60%/70% sparsity with OTOv2?

Look forward to your reply.

Thanks!

@mountain111

Good question. The best results were 74.5, 72.3 and 70.3, which were actually reported in a table. But it was then replaced by a figure for better visualization upon reviewer.

Note that when we release the tutorials this summer, we will update the arxiv version with the table added back accordingly upon the latest OTO library for the ease of comparison for future works. Since the OTO library is actively evolving, some hyper-parameters and mechanisms have been simplified for the ease of usage, then the arxiv version may have negligible difference.

Before the coming tutorial release, I would recommend to read the tricks for imagenet if you haven't. I have not went through https://github.com/pytorch/examples/blob/main/imagenet/main.py in depth due to my recent research workloads, while it indeed looks relatively simple, I am not sure if you could use it to reproduce 76.1 acc on full resnet50.

Closed as the tutorial has been uploaded. We will update the arxiv version with the concrete numbers for the ease of comparison in this month later.