optimizer

Question

optimizer

zhu011 opened this issue 2 years ago · comments

Very good work, I have a question, can I use other optimizers，as SGD

Tianyi Chen · Answer 1 · Mon Apr 03 2023 01:19:53 GMT+0800 (China Standard Time)

Thanks for the question.

You could set up variant in DHSPG for other optimizer. We currently support sgd, adam and adamw and will add more upon request and the popularity of different optimizer. These three should cover the majority of DNN training experiments.
For example,

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, # set same as the baseline training
        weight_decay=1e-2,  # set same as the baseline training
        first_momentum=0.9, # set same as the baseline training
        second_momentum=0.999, # set same as the baseline training
        dampening=0.0, # set same as the baseline training
        target_group_sparsity=0.7,  # choose upon how much you wanna compress
        start_pruning_steps=X* len(trainloader), # start pruning after X epochs, depends on total epochs, start pruning 1/5 total epochs is typically fine.
    )

Tianyi Chen · Answer 2 · Mon Apr 03 2023 01:26:07 GMT+0800 (China Standard Time)

@zhu011

Here is a shot summary of DHSPG that I answered in another thread. I pasted here for elaboration.

What is DHSPG?

DHSPG is a hybrid optimizer, it applies the baseline optimizer over all variables before starting pruning and over the variables that are considered as potentially important during pruning. For the variables that are considered as maybe redundant, a step called Half-space step is proceeded to yield them onto zero. Once group sparsity reaches the target, the optimizer performs as the baseline optimizer till ultimate convergence.

The ultimate performance typically depends on 1. how the baseline model can achieve, 2. if gives fairly enough steps for warming up, and 3.if gives sufficiently many steps after reaching target group sparsity.

Does DHSPG have computational overhead than standard optimizer?

As a hybrid optimizer, DHSPG indeed has some computational overhead during pruning (when group sparsity is increasing). The overhead is typically varying upon model and dataset. For majority models, the overhead is negligible, but some are not (the worst case I met would double the cost). But remark here that the overhead is temporary and will disappear once group sparsity reaches the target value (afterwards the DHSPG performs the same as the baseline optimizer).

Therefore, to speed up if needed, I would suggest shrinking the pruning procedure, i.e., to make the group sparsity increase faster to reach the target value, which can be typically achieved via fine-tuning the hyperparameters related to group sparsity exploration. In fact, most of experiments I conducted could shrink the pruning stage into just a few epochs, which largely mitigates the overhead.