jackdawjackdaw / emulator

Gaussian Process regression library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Emulator

C.Coleman-Smith, cec24@phy.duke.edu,
H.Canary, <cs.unc.edu/~hal>

Gaussian Process regression in C with an R library too

Brief:

A project to implement scalar Gaussian process regression with a variety of covariance functions with an optional (up to 3rd order) linear regression prior mean. Maximum likelihood hyper-parameters can be estimated to parametrize a given set of training data. For a given model (estimated hyperparams and training data set) the posterior mean and variance can be sampled at arbitrary locations in the parameter space. 

The model parameter space can take any dimension (within limits of the optimizer), output must be scalar. 

Primary user interface is through the interactive_emulator program

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Depends:

cmake (http://www.cmake.org/cmake/resources/software.html)
gsl 
R (http://www.r-project.org/) (for using the R lib)

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Compiling:

./build.sh

On OS-X you need to set the following environment variables:

export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:"~/local/lib"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"~/local/lib"

Otherwise R will complain when you try and load libRbind.dylib

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Interactive Emulator:

  useage: 
    interactive_emulator estimate_thetas INPUT_MODEL_FILE MODEL_SNAPSHOT_FILE [OPTIONS]
  or
    interactive_emulator interactive_mode MODEL_SNAPSHOT_FILE
  or 
    interactive_emulator print_thetas MODEL_SNAPSHOT_FILE

  estimate_thetas -> train an emulator on a given set of model outputs, all the information needed to 
	use the emulator is written to MODEL_SNAPSHOT_FILE

	interactive_mode -> sample the emulator contained in MODEL_SNAPSHOT_FILE at a set of points given
	on STDIN, the emulated mean and variance are output to STDOUT

	print_thetas -> write out the covariance length scales for the emulator given by MODEL_SNAPSHOT_FILE
		
  INPUT_MODEL_FILE can be "-" to read from standard input.

  The input MODEL_SNAPSHOT_FILE for interactive_mode must match the format of
  the output MODEL_SNAPSHOT_FILE from estimate_thetas.

  Options for estimate_thetas can include:
    --regression_order=0
    --regression_order=1
    --regression_order=2
    --regression_order=3
    --covariance_fn=0 (POWER_EXPONENTIAL)
    --covariance_fn=1 (MATERN32)
    --covariance_fn=2 (MATERN52)
	  (-v=FRAC) --pca_variance=FRAC : sets the pca decomp to keep cpts up to variance fraction frac
	 options which incluence interactive_mode:
	 (-X) --covmat: output the covariance matrix
	 (-q) --quiet: run without any extraneous output
	 (-z) --pca_output: emulator output is left in the pca space
   general options:
	  -h -? print the help dialogue

  The defaults are regression_order=0 and covariance_fn=POWER_EXPONENTIAL.

  These options will be saved in MODEL_SNAPSHOT_FILE.

INPUT_MODEL_FILE (With multi-output) FORMAT:
  Models with multivariate output values y = y_1...y_t which we will think of as rows of a matrix, in the following spec
  number_outputs = t
  BEGIN EXAMPLE
   number_outputs
   number_params 
   number_model_points
   X[0,0]
   ...
   X[0,number_params-1]
   X[1,0]
   ...
   X[number_model_points-1,number_params-1]
   Y[0, 0]
   Y[0, 1]
   ...
   Y[0, number_outputs-1]
   ...
   Y[1, 1]
   ... 
   Y[number_model_points-1, number_outputs-1]
  END EXAMPLE

  number_outputs, number_params and number_model_points should be
  positive integers.  X[i,j] and Y[i,j] will be read as
  double-precision floats.

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Example Analysis Using Interactive Emulator: 
 Several example analysis setups can be found in the test directory:
 
 ./test/uni-simple -> a test model with a single parameter and a 1 dimensional output
 ./test/uni-2d-param -> a test model with a 2 dimensional parameter space and a 1 dimensional output
 ./test/multi-simple -> a test model with a 3 dimensional parameter space and a 5 dimensional output

 Each example in test has a script train-emulator.sh which will build an emulator based on that particular model, the emulator can then be sampled using sample-emulator.sh and figures can be built using the enclosed R files.
	

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Example Usage (R):

The file examples/simple-regression.R generates a set of model data for a 1d model and then trains an emulator on this data. Build and install the code base, fire up an R shell and load the file with source("simple-regression.R"). 

If everything works OK you should see some output in the R terminal and a plot of the emulator predictions (red) against the true model (black) and the training points (green triangles). The output should look somethign like examples/r-simple-plot.pdf.

It is interesting to Try changing the number of model samples (the argument to make.model) and the cov.fn and reg.order used in estimate.model.

The file 2d-regression.R builds an emulator for a 2d model, makes predictions over a grid and plots
the predicted mean, variance, the original model and the variance weighted squared difference:
		D(u*) = sqrt{ (m(u*) - yM(u*))**2 / cov(u*) }
This is fun to play with.

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=

Emulator Theory Summary:

Gaussian Process regression is a supervised learning method, a representative set of samples of a model's output can be used to build a predictive distribution for the output at untried locations in the models parameter space. This code base has been developed to apply this process to the output of complex computer codes.

Suppose we have a model Y_m(u) which produces scalar output as a function of some set of input parameters u. A design is a set of points D = {u_1, ..., u_n} in the u space at which we will evaluate the model. A set of training outputs Y_t = {Y_m(u_1), ... Y_m(u_n)} is produced by evaluating the model at each location in D. The sets D and Y_t are then sufficient to 'train' a GP regression model or Emulator to produce predictions at some new locations u*. 

The training process conditions a Gaussian Process prior on the set {Y_t, D} such that functions drawn from the ensuing posterior will always pass through the training locations. A Gaussian Process (GP) is a stochastic process for which any finite set of samples have a multi variate normal distribution. In essence we force the Gaussian Process to always generate functions which interpolate our data set. 

The GP is specified in terms of a prior mean, usually chosen to be zero or given by a simple regression model of the training data, and a covariance function over the parameter space. The covariance function specifies the correlation between pairs of parameter values u_i, u_j. A power exponential form, with a nugget, is the default. 

 C(u_i, u_j) = theta_0 * exp(-1/2 ( u_i - u_j) ^2 / theta_l^2) + theta_1.

The covariance function itself is controlled by a set of hyper-parameters theta_i, these act as characteristic length scales in the parameter space and are a-priori unknown. The training process consists of estimating the 'best' set of length scales for the given model data set. This estimation process is done through numerical maximization of the posterior hyper parameter likelihood, a fully Bayesian treatment is also possible.

Once the GP has been conditioned, and a best set of hyper parameters Theta has been obtained,  we have a posterior predictive distribution for the model output at a new location u*:

 P(Y_m(u*) | Y_t, D, Theta) = MVN(m(u*), cov(u*)),

where the posterior mean and covariance are (for a zero prior mean)

 m(u*) = C(u*)_(D)^t  C(D,D)^{-1} Y_t
 cov(u*) = C(u*,u*) - C(u*)_(D)^t C(D,D)^{-1} C(u*)_(D).

Here C(D,D) represents the matrix formed by evaluating the covariance function C(u_i, u_j) at each of the locations in D, the vector C(u*)_(D)  = {C(u*, u_1), C(u*, u_2),...C(u*, u_n)} is the correlation between each location in the design set and the new point. 

These equations have a relatively straightforward interpretation. The prior mean (0) is modified by the covariance structure deduced from the data C(u*)_(Y_t)^t  C(D,D)^{-1} Y_t, at u* = u_i where u_i is in D, we can see that m(u_i) = 0. The prior covariance at u*, C(u*,u*) is somewhat reduced by the training set C(u*)_(Y_t)^t C(D,D)^{-1} C(u*)_(Y_t) and again at u*=u_i for u_i in D we reduce to cov(u_i) = 0. As such we have constructed an interpolating function.

It is important to note that unlike some other data interpolation/extrapolation methods we end up with a probability distribution for the value of our model at an untried location, as it is normal the mean and covariance are sufficient to describe the distribution. These can be obtained by the emulate_at_point or emulate_at_list functions.

It is straightforward to draw samples from the predictive distribution, although care must be taken to expand the above equations correctly for a set of m observation locations U* = {u*_1, ... u*_m}.

For more information see the documentation, the website http://www.gaussianprocess.org and the invaluable book "Gaussian processes for machine learning".

About

Gaussian Process regression library

License:MIT License


Languages

Language:C 52.4%Language:Fortran 20.1%Language:R 18.3%Language:C++ 8.2%Language:Shell 0.9%