GPDPQuantReg

Author

Carlos Omar Pardo Gomez (cop2108@columbia.edu)

Overview

R Package. Bayesian and nonparametric quantile regression, using Gaussian Processes to model the trend, and Dirichlet Processes for the error. An MCMC algorithm works behind to fit the model.

Model

Be one dependent random variable y, and a vector of independent variables x, so P(y|x) makes sense. Be q_p the function which returns the p-quantile for a random variable. The model assumes that given any probability p (between 0 and 1) for which you want to estimate the y|x's p-quantile, an observation y comes from the sum

$y = f_p(x) + \varepsilon_p$ ,

where f_p is the function which links x to y, and epsilon is a dispersion random variable, such that its p-quantile is 0. Then, because the quantile of a sum is the sum of the quantiles, we get that q_p(y|x) = f_p(x). This package is going to focus on estimating f_p.

We are going to address that problem by using the GPDP model, which is described below. (For a complete understanding of it, it is recommended to have some previous knowledge in Gaussian Processes (Bagnell's lecture) and Dirichlet Processes (Teh 2010), particularly, the stick-breaking representation).

$\begin{aligned} y_i| f_p(x_i), z_i, \sigma_k^* &\sim AL_p({\varepsilon_p}_i = y_i - f_p(x_i) | \sigma_{z_i}), \\ f_p|m, k, \lambda &\sim \mathcal{GP}(m,k(\lambda)|\lambda), \\ \lambda &\sim GI(c_\lambda,d_\lambda), \\ z_i | \pi &\sim Mult_\infty(\pi), \\ \pi | \alpha &\sim GEM(\alpha), \\ \sigma_k^* | c_{DP}, d_{DP} &\sim GI(\sigma_k|c_{DP}, d_{DP}),\\ k(x_i, x_j | \lambda) &= \lambda \text{ } exp\{-||x_i - x_j||_2\}. \end{aligned}$

Where:

p is the probability for which you want to estimate the quantile.
AL_p is the Assymetric Laplace Distribution (ALD), with density function given by $w_p^{AL}(u|\sigma) = \frac{p(1-p)}{\sigma} exp\left[ -\rho_p \left( \frac{u}{\sigma} \right) \right]$ , where $\rho_p(u) = u \times [pI_{(u>0)} - (1-p) I_{(u<0)})]$ .
GP is a Gaussian process, with mean (m) and covariance (k) functions.
GI is the Inverse-Gamma distribution.
Mult_inf is the Multinomial distribution, when the number of categories tend to infinity.
GEM (for Griffiths, Engen and McCloskey) is a distribution used in Dirichlet Processes' literature, as described in Teh (2010).

Algorithm

An MCMC algorithm is used to find the f_p's posterior distribution via simulations, particularly, a Gibbs sampler has been developed.

Since the theoretical model is a nonparametric one, it contemplates infinite parameters, particularly for the Dirichlet process. However, it's clear we cannot estimate and allocate such a number of values, so this packages uses the slice sampling algorithm proposed by Kalli et al. to truncate the number of them in a dynamic way. The results approximately converge to the expected ones, if we could do it in the theoretical way.

Example

First, you must install the package directly from Github. (I hope eventually you will do it from CRAN.)

library(devtools)
install_github("opardo/GPDPQuantReg")
library(GPDPQuantReg)

Then, you can create some artificial data to start familiarizing with the package's dynamic. In this case, I'm using a complex trend function and a non-normal error, sampling ONLY 20 POINTS.

set.seed(201707)
f_x <- function(x) return(0.5 * x * cos(x) - exp(0.1 * x))
error <- function(m) rgamma(m, 2, 1)
m <- 20
x <- sort(sample(seq(-15, 15, 0.005), m))
sample_data <- data.frame(x = x, y = f_x(x) + error(m))

Now, it's time to fit the model with a MCMC algorithm for a specific p probability.

GPDP_MCMC <- GPDPQuantReg(y ~ x, sample_data, p = 0.250)

Since it is a nonparametric model, it is focused on prediction with a credible interval.

predictive_data <- data.frame(x = seq(-15, 15, 0.25))
credibility <- 0.90
prediction <- predict(GPDP_MCMC, predictive_data, credibility)

And for a complex trend function, non-normal error and only 20 sampled points... we get AWESOME RESULTS!

Some diagnostics (ergodicity, autocorrelation, crosscorrelation and traces) for the Markov Chains are available too.

diagnose(GPDP_MCMC)

Test

TODO

opardo / GPDPQuantReg