scikit-learn / scikit-learn

scikit-learn: machine learning in Python

Home Page:https://scikit-learn.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DotProduct Kernel not non-negative definite after rounding errors

chrishmorris opened this issue · comments

Description

GPR will fail in the Cholesky decomposition if it finds negative eigenvectors. Cholesky decomposition itself is numerically stable. Unfortunately rounding errors in the kernel can cause this error, and this occurs with the DotProduct kernel.

Steps/Code to Reproduce

model = gaussian_process.GaussianProcessRegressor(
kernel=kernels.DotProduct(),
optimizer='fmin_l_bfgs_b',
random_state=None)
model.fit(X,Y)

Expected Results

Mathematically speaking, Dot Product is a kernel, i.e. symmetric and non-negative definite, so this should succeed.

Actual Results

A LinAlg error may be thrown in the Cholesky decomposition. The error is not in this routine. Rather, the value of kernels.DotProduct()(X) can have small negative eigenvalues after rounding errors.

An example of values for which this fails is at http://pims.structuralbiology.eu/X.csv

Versions

Linux-3.10.0-514.6.1.el7.x86_64-x86_64-with-centos-7.3.1611-Core
('Python', '2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18')

Thanks for reporting the issue and providing a dataset. A full code example that breaks works really help someone trying to fix it.

Here is code to reproduce. The result is:

LinAlgError: 114-th leading minor not positive definite

import urllib2, csv
import sklearn, random
from sklearn import gaussian_process
import sklearn.gaussian_process.kernels as kernels
f = urllib2.urlopen('http://pims.structuralbiology.eu/X.csv') 
X = [[float(i) for i in row] for row in csv.reader(f, delimiter=',') ]
f.close()
model = gaussian_process.GaussianProcessRegressor(
 kernel=kernels.DotProduct(), 
 optimizer='fmin_l_bfgs_b',
 random_state=None
)
model.fit(X, [0.0]*len(X))

I can reproduce with a conda environment using openblas (although I got a complaint about the 140-th leading minor and not the 114-th one). I can not reproduce the error with a conda environment using mkl.

For the record readability counts, a lot! Please use triple back-quotes aka fenced code blocks to format error messages and code snippets.

Also, it does not seem to matter in this particular case, but setting the random_state to a integer (rather than None) will make behaviour deterministic.

I think that the underlying problem is that cholesky is supposed to work for positive-definite matrices. In this case np.dot(X, X.T) is positive-semi-definite (X has shape (150, 3) so np.dot(X, X.T) has rank 3 at most). I am not an expert about this but I would think there are ways to tackle this that may exist already Scientific Python ecosystem.

Yes, you correctly say that the Cholesky decomposition assumes a non-negative definite matrix. Any valid kernel for GPR must be non-negative definite. From the point of view of abstract mathematics, DotProduct meets this criterion, so correctly scikit offers it in the kernels package.

The defect here is that when I put these together as above, rounding errors create some negative eigenvectors. So a mathematically correct regression failed due to numerical issues.

The defect is not in the Cholesky decomposition. It is rounding errors in DotProduct. It is not sufficient for this kernel to calculate the individual dot products to the available precision. To be usable for GPR, it must also ensure that the resulting matrix as a whole is non-negative definite.

I appreciate that this might be difficult. There is a possible workaround, which is to supply an identity kernel, one that has a covariance of 0 between different entries and 1 on the diagonal. Then people who hit this issue could add in a small multiple of the identity kernel as follows:

model = gaussian_process.GaussianProcessRegressor(
 kernel=kernels.DotProduct()+0.001*kernels.Identity(), 
 optimizer='fmin_l_bfgs_b',
 random_state=None
)

The defect here is that when I put these together as above, rounding errors create some negative eigenvectors. So a mathematically correct regression failed due to numerical issues.

np.dot(X, X.T) has some zero eigenvalues, even if there were no floating point errors, and the smallest eigenvalues were zero instead of negative, cholesky would still give an error because np.dot(X, X.T) is not positive definite. Try this for example:

import numpy as np

np.linalg.cholesky([[1., 0.], [0., 0.]])  # LinAlgError: Matrix is not positive definite

There is a possible workaround, which is to supply an identity kernel, one that has a covariance of 0 between different entries and 1 on the diagonal. Then people who hit this issue could add in a small multiple of the identity kernel as follows:

This seems like a reasonable work-around although I am not an expert in Gaussian processes at all.

Lesteve has identified another defect. Kernels must be postive semidefinite, i.e. have no negative eigenvalues, but can have zero eigenvalues. There are implementations of a Cholesky-style factorization that can handle this, and that is what is required for GPR.

Just to be clear, the Identity kernel I suggest is not implemented yet. This is a request to implement it. This would also supply a workaround for the issue lesteve has identified. It is analogous to adding one to a series before reciprocating to avoid divide by zero errors.

Just to be clear, the Identity kernel I suggest is not implemented yet.

It seems like kernels.DotProduct() + kernels.WhiteKernel(noise_level=0.001) does what you want.

ping @jmetzen who reimplemented the gaussian_process subpackage for 0.18. He may have an informed opinion about this.

Thanks for reporting this. @lesteve is right that adding a WhiteKernel with a certain noise level would add a diagonal matrix to the kernel. However, there is also the parameter alphaof GaussianProcessRegressor, which does exactly this and is even easier to use (please see here). It defaults to alpha=1e-10 which should ensure in theory that the kernel is positive definite. If you encounter numerical problems, increasing alpha by a few orders of magnitude should be sufficient. However, the current behavior of throwing a LinAlgError is clearly not very user-friendly. At the very least, a useful error with the suggestion of increasing alpha should be thrown. If someone volunteers, it should be relatively easy to add (just catch LinAlgError and throw an appropriate Exception with a useful error string).

At the very least, a useful error with the suggestion of increasing alpha should be thrown. If someone volunteers, it should be relatively easy to add (just catch LinAlgError and throw an appropriate Exception with a useful error string).

This would make sense indeed.

@chrishmorris does using alpha in GaussianProcessRegressor work for your use case? If so can you please close this issue, and I'll create another one about the confusing error message.

Yes, thank you!

OK great, thanks for following up. I opened #8384 to improve the error message.

I also suggest that in http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html where it says

alpha : float or array-like, optional (default: 1e-10)

    Value added to the diagonal of the kernel matrix during fitting. Larger values correspond to increased noise level in the observations and reduce potential numerical issue during fitting. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. Note that this is equivalent to adding a WhiteKernel with c=alpha. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with Ridge.

this is modified to read

alpha : float or array-like, optional (default: 1e-10)

    Value added to the diagonal of the kernel matrix during fitting. Larger values correspond to increased noise level in the observations. This can also prevent a potential numerical issue during fitting, by ensuring that the calculated values form a positive definite matrix. If an array is passed, it must have the same number of entries as the data used for fitting and is used as datapoint-dependent noise level. 
Note that this is equivalent to adding a WhiteKernel with c=alpha. Allowing to specify the noise level directly as a parameter is mainly for convenience and for consistency with Ridge.

PR welcome !

Your change amounts to adding this (quite hard to figure out visually, next time it is probably better to post a diff or directly open a PR), right?

by ensuring that the calculated values form a positive definite matrix

@jmetzen what do you think?

Yes, adding this to the docstring would make sense. It can be added to #8384