GPflow / GPflowOpt

Bayesian Optimization using GPflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cholesky faillures due to inappropriate initial hyperparameters

javdrher opened this issue · comments

As mentioned #4 , tests often failed (mostly on python 2.7) due to cholesky decomposition errors. First I thought this was mostly caused by updating the data and calling optimize() again, but resetting the hyperparameters wasn't working all the time. Increasing the likelihood variance sometimes helps slightly but isn't very robust either.

Right now the tests specify lengthscales for the initial model, and apply a hyperprior on the kernel variance. Each BO iteration, the hyperparameters supplied with the initial model are applied as a starting point. In addition, restarts are applied by randomizing the Params. This approach made it a lot more stable but isn't perfect yet. Especially in longer runs of BO, reverting to the supplied lengthscales each time ultimately causes crashes.

Some things we may consider:

  • Normalizing the input/output data. Tested this a bit, didn't solve the issue. Additionally, the model hyperparameters loose some interpretability. Note that I think we will ultimately need this for PES anyway.
  • Add a callback for re-configuring hyperparameters. Instead of reverting to the initially supplied hypers each iteration, this function is called and configures the initial state. I think for more complex modeling approaches this ultimately required, but for simple scenario's with GPR this has to work automatically.
  • Applying hyperpriors is going to be important.

I'd love to hear thoughts on how to improve this.

This is not necessarily my area of expertise but I assume Cholesky is failing be because the covariance matrix is not positive definite. Is that correct? If so, how could this be the case?

I understand Cholesky is preferable for inversion problems due to performance. Would be nice if there is a more stable fallback option such as QR in GPflow. But that's out of our hands I suppose.

My experience is that normalization of both inputs and outputs is helpful. I don't agree that we lose any interpretability of the lengthscales: we can always just rescale them. My suggestion is to normalize by default, allow experts to turn it off.

I'm curious to know where the numerical issues are coming from. Are they happening when there are very few points? Or when we have many close-together points? The first suggests that the optimizer is reaching a poor setting of the lengthscales, fixed by sensible prior regularization (a sensible default is possible if the inputs are normalized). The second is quite common in BO and the simplest solution is to increase the noise: perhaps the noise variance should not be allowed to be smaller than, say 1e-3 (note that the scale of Y should be 1 if the data are normalized).

I was thinking towards auto-normalization indeed, in python a wrapping class which automatically takes care of scalings should be straightforward. However, I would like to keep the model as much as a "black-box" as possible so it remains possible to use other kernels and plug in different models without writing wrappers for all of them. That makes automatic rescaling of hyperparameters a bit tricky as under this setting, as it isn't clear what hyperparameters are being optimized.

For the tests, the instability is mostly occuring for python 2.7. The typical scenario prior to a crash are huge lengthscales and a large kernel variance. With 3.5 it seems to be a lot more stable. It occurs with grid-based designs of size 16 or 25. I'm thinking the grids might actually the optimization of the lengthscales a bit binary?

As far as I know it occurs at the start of the BO, when little data is available. In that case the signal variance (and lengthscales) explodes while the noise variance stays at 1, causing instability.

Bounding the optimization is indeed one option. Alternatively (or rather, in addition) I was also wondering if it is possible to catch this in GPFlow and return gracefully, where the model state is set to the last found hyperparameters that still worked (if any). In my opinion, the problem will then fix itself when more data becomes available (optionally helped along by MCMC sampling the hyps too).

Normalization is a given, but I was wondering if this would be the responsibility of GPFlowOpt or the user? I'm thinking to leave it for the user to decide for himself, it would ease the complexity. For instance, predictive entropy relies directly on the hyperparameters which each would then need to be (un)scaled correctly.

For the normalization I was thinking about the following wrapper for a model in the Acquisition class (not complete and has some issues)

class Normalizer(Model):
    def __init__(self, model, normalize_input=True, normalize_output=True):
        assert (model is not None)
        self._wrapped = model
        super(Normalizer, self).__init__()

        self._input = normalize_input
        self.input_transl = DataHolder(np.zeros((model.X.shape[1],)))
        self.input_scale = DataHolder(np.ones((model.X.shape[1],)))

        self._output = normalize_output
        self.output_transl = DataHolder(np.zeros((model.Y.shape[1],)))
        self.output_scale = DataHolder(np.ones((model.Y.shape[1],)))

        self.X = model.X.value
        self.Y = model.Y.value

    def __getattr__(self, item):
        return self.__getattribute__('_wrapped').__getattribute__(item)

    def __setattr__(self, key, value):
        if key is '_wrapped':
            object.__setattr__(self, key, value)
            value.__setattr__('_parent', self)
            return

        super(Normalizer, self).__setattr__(key, value)

    # Overwrites

    @property
    def X(self):
        return DataHolder(self._wrapped.X.value * self.input_scale.value + self.input_transl.value)

    @property
    def Y(self):
        return DataHolder(self._wrapped.Y.value * self.output_scale.value + self.output_transl.value)

    @X.setter
    def X(self, value):
        if self._input:
            self.input_transl = value.mean(axis=0)
            self.input_scale = value.std(axis=0)
        self._wrapped.X = (value - self.input_transl.value) / self.input_scale.value

    @Y.setter
    def Y(self, value):
        if self._output:
            self.output_transl = value.mean(axis=0)
            self.output_scale = value.std(axis=0)
        self._wrapped.Y = (value - self.output_transl.value) / self.output_scale.value

    @AutoFlow((float_type, [None, None]))
    def predict_f(self, X):
        pred_f, pred_var = self._wrapped.build_predict((X - self.input_transl) / self.input_scale)
        return pred_f * self.output_scale + self.output_transl, pred_var * (self.output_scale ** 2)

That code looks wonderful. I think normalization is helpful for sure. The interpretation of a prior's parameters may change if the data scaling changes, which needs to be made clear to user I think.

As to the Cholesky issue, as long as the covariance matrix is positive definite and not too rank deficient the decomposition should work. The following things could help:

  • Make sure that the initial samples on X are sufficiently different. The initial sampling strategy needs to ensure enough distance between the samples, and normalization on X may help out here.
  • Outcomes on Y of the initial samples need to be sufficiently different too, otherwise the inferred lengthscales could end up being to large, causing entries in the covariance matrix to go to zero. Next to normalization on Y, adding a bit of jitter to normalized Y values initially could help here.
  • A more straightforward approach would be to limit the size of the lengthscales either by capping it at a certain value or biasing them to be 'small' by choosing the right hyperpriors. E.g. InvGamma with two very small parameters as a default.

Since the third option is eventually in the hands of the user (if they don't use the default priors) options 1 and 2 are probably best to go forward with, which your code seems to effectively do.

One simple strategy for the lengthscales would be to set them to (e.g.) 0.8 times the length of the domain until we have sufficient points to estimate them more reasonably than that. For example, if the number of data is smaller than the dimension of the data, it's going to be impossible to estimate the lengthscales anyway, so I suggest sticking with the above estimate.

Two things with normalization:

  1. I suggest that inputs are normalized to [0, 1]. Not based on the data, but based on the domain that the user specifies.

  2. be careful dividing by the std(), which is zero if n==1.

So, in summary we could do the following:

  • Provide an automated normalizer (which can be switched off). It scales output to N(0,1) and input to [0, 1].
  • Change _optimize_all in acquisition.py to something like this:
def _optimize_all(self):
      for model in zip(self.models):
            runs = []
            self._hyper_callback(model)
            for i in range(self._optimize_restarts):
                    ...

The callback is configurable and by default it implements the current behaviour. This allows a user to optionally control the initial hyperparameters manually each iteration. I'd pass the model with the normalized data here so priors, fixes and transforms can be derived for the proper scale.

  • Specify clearly in the documentation that with the normalization switched on the hyperparameters were determined on normalized data, and it is up to the user to rescale them if needed as the rescaling depends on what kernel / likelihood was used.

@jameshensman , regarding setting the initial default lengthscales to 0.8 * the size of the domain. What would be your suggestion to find the lengthscale Params in the model? Should we traverse the tree below model.kern and look for lengthscale Params ?

I was testing a bit further last night and I think for the tests, it might be sufficient we get rid of the grid designs used now and use something better. With just random points it was much harder to reproduce the instability (although those aren't ideal for automated tests either).

Provide an automated normalizer (which can be switched off). It scales output to N(0,1) and input to [0, 1].

Agree, this makes sense and still makes sure the scale of the input and output is similar.

Change _optimize_all in acquisition.py to something like this:

  • maybe rename optimize_all to optimize_models (I have a thing for proper names) :)
  • we would need a separate callback for each model in the list?
  • how would the user know which normalization is used. This is passed along with the callback?

maybe rename optimize_all to optimize_models (I have a thing for proper names) :)

we would need a separate callback for each model in the list?
makes sense

we would need a separate callback for each model in the list?

or pass the list, or the index. As this callback is intended to provide control for specific situations, any way to find out what the models are would do I guess

how would the user know which normalization is used. This is passed along with the callback?

I'd remove all scaling wrappers so in the callback the user has the GPflow model, with the normalized data in it. Based on that, priors, fixes, transforms and values can be adjusted.

The problem disappeared completely by switching from the grids to latin hypercubes. I removed all hyperparameter initialization, scalings and the hyperprior. It works out of the box.

Nevertheless, supporting the scalings still seems useful.

Steps to finalize the solutions to this issue:

  • Normalization
  • Hyperparameter configuration call back (in progress, might involve a GPflow PR , see GPflow/GPflow#442 )
  • Add a failsafe mode for BayesianOptimizer which saves the evaluated data in case of a catastrophic model failure. Doing our tests we noticed crashing runs are usually resolved by taking the data prior to the crash and try to model it directly with GPflow so this seems a natural addition to the framework.

All steps were implemented, closing this issue for now.