anh-tong/hyper-opt

Description

This repo is to implement methods in

The motivation to reimplement is to

serve my own learning purpose
have a cleaner source code
compare several approaches related to inverse Hessian vector product

Implementation detail

Model

Suppose we have a model which is a subclass of nn.Module, containing all parameters. BaseHypeOptModel in model.py will wrap this model and add hyperparameters. This BaseHyperOptModel will manage and intergate all the hyperparmaters in the main model such as computing the train loss via train_loss function, compute validation loss via validation_loss function. Currently BaseHyperOptModel allows its subclass to customize regularization and data augmentation.

Example

Let us define the logistic regression model for L2 regularization problem:

class LogisticRegression(nn.Module):
    
    def __init__(self, input_dim):
        super().__init__()
        self.w = nn.Parameter(torch.randn((input_dim, 1)))
    
    def forward(self, x):
        return x @ self.w

In this example, we will try to optimize L2 hyperparameter. The following object will handle this hyperparameter

class L2RHyperOptModel(BaseHyperOptModel):
    
    def __init__(self, input_dim) -> None:
        network = LogisticRegression(input_dim)
        criterion = nn.BCEWithLogitsLoss()
        super().__init__(network, criterion)

        # declare hyperparmeters    
        self.hparams = nn.Parameter(torch.ones(input_dim,1))
        
    @property
    def hyper_parameters(self):
        # return a list of hyperparameters
        return [self.hparams]
    
    def regularizer(self):
        # regularizer will be added to the train loss
        return 0.5 * (self.network.w.t() @ torch.diag(self.hparams.squeeze())) @ self.network.

Optimizer

We introduce BaseHyperOptimizer object which computes hypergradient for hyperparameters via implicit function theorem. The subclass extending this object should provide a way to approximate inverse Hessian vector product. The current implementation contains serveral approaches

Conjugate Gradient
Neumann series expansion
Fixed point

BaseHyperOptimizer allows to pick whether the hyper gradient is computed over 1 batch (set stochastic=False) or multiple batches (set stochastic=True). Refer to the AISTATS paper for the stochastic version.

This optimizer allows to choose between using Hessian matrix or Gauss-Newton Hessian matrix (see this).

In each optimizer step, BaseHyperOptimizer will take inputs including train_loss_func which is a function returing two outputs (train loss, train logit) and val_loss which is the validation loss.

Some useful references

Hypertorch library: An excellent library which this repo adopts in many parts. However, it's a little bit hard to work around with nn.Module.parameters.
GradientBased Optimization of HyperParamete: Hyperparameter Optimization is dated back in the year 2000 by the work of Bengio.
Hyperparameter optimization with approximate gradient, ICML 2016: Maybe the first work of hyperparameter optimization using implicit gradient. Here the approximation tool is conjugate gradient method
On the Iteration Complexity of Hypergradient Computation, ICML 2020: In-depth comparison (convergence and approximate error) between iterative differentation (or unrolling) and approximate implicit differentation. The approximation considers two cases: fixed point vs conjugate gradient
Convergence Properties of Stochastic Hypergradients, AISTATS 2021: This work is quite important since previously we may blindly train implicit differentation method with minibatches of data and not know if it really converges.
Optimizing Millions of Hyperparameters by Implicit Differentiation: Approximate implicit differentation with Neumann series expansion.
Efficient and Modular Implicit Differentiation: A recent work from Google explains a general, modular approach which modularizes solvers and autodiff.
Roger Grosse's course: Excellent material for beginners from basic optimization to bilevel optimization.

anh-tong / hyper-opt