Cholesky decomposition fails

Question

Cholesky decomposition fails

mccajm opened this issue 7 years ago · comments

I receive the following error when performing optimisation with GPR over 2 dimensions, using GPR with an RBF ARD kernel and a latin hypercube design of size 10. I assume this is because the matrix can't be decomposed? Is this fixable by changing the design or adding priors?

Thanks

2017-07-20 01:50:18.494935: W tensorflow/core/framework/op_kernel.cc:1158] Internal: cuSolverDN call failed with status =7
Traceback (most recent call last):
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
return fn(*args)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
status, run_metadata)
File "/home/adathy/miniconda3/lib/python3.6/contextlib.py", line 89, in exit
next(self.gen)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: cuSolverDN call failed with status =7
[[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "t1-hyperparam.py", line 103, in
optimizer.optimize(run_model, n_iter=10)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 131, in optimize
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/optim.py", line 79, in optimize
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 147, in _optimize
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/bo.py", line 67, in _update_model_data
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 122, in set_data
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 254, in setup
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/param.py", line 569, in runnable
return storage['session'].run(storage['tf_result'], feed_dict=feed_dict)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuSolverDN call failed with status =7
[[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]]
Caused by op 'Cholesky_1', defined at:
File "t1-hyperparam.py", line 101, in
acquisition = GPflowOpt.acquisition.ExpectedImprovement(model)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 248, in init
self.setup()
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/acquisition.py", line 254, in setup
samples_mean, _ = self.models[0].predict_f(feasible_samples)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/param.py", line 561, in runnable
storage['tf_result'] = tf_method(instance, *storage['tf_args'])
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflow-0.3.8-py3.6.egg/GPflow/model.py", line 373, in predict_f
return self.build_predict(Xnew)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/scaling.py", line 210, in build_predict
return self.output_transform.build_backward(f), self.output_transform.build_backward_variance(var)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/GPflowOpt-pre_release-py3.6.egg/GPflowOpt/transforms.py", line 112, in build_backward
L = tf.cholesky(tf.transpose(self.A))
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 227, in cholesky
result = _op_def_lib.apply_op("Cholesky", input=input, name=name)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/adathy/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InternalError (see above for traceback): cuSolverDN call failed with status =7
[[Node: Cholesky_1 = CholeskyT=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Joachim van der Herten · Answer 1 · Sat Jul 22 2017 07:57:03 GMT+0800 (China Standard Time)

This issue is indeed caused by a cholesky decomposition faillure. The reason why this happens can be a bit diverse.
Does this happen immediately after the initial 10 points? or have you done some iterations of BayesianOptimizer? In case of the former: first try to model the points with the GPflow model itself. tune the initial hyperparameters or add a prior. In case of the latter: check the data before it crashes. Do you have duplicate points? If not, try to model it again and tune the initial hyperparameters/priors.

I have also opened a PR (#40) which will make saving data in case of a crash easier. Just resolving some compatibility issues now.