The Oumuamua
contains a parallel implementation of the Multivariate Adaptive Regression Splines algorithm suggested in Friedman (1991). The package can be installed from Github by calling
devtools::install_github("boennecd/Oumuamua")
There is yet no backronym for the package. The rest of this README contains simulation examples where this package is compared with the earth
package. The simulation examples are from Friedman (1991) and are shown to illustrate how to use the package. A discussion about implementation differences between the earth
package and this package is given at the end. Some comments about future tasks are also listed at the end.
We start of with a model with the additive model shown in Friedman (1991, 35). The model is
where follows a standard normal distribution and the s are uniformly distributed on . Moreover, we let the s be correlated and introduce a fixed number of noisy correlated variables.
We start by defining two function to simulate the covariates and outcomes.
# generates correlated uniform variables
get_covs <- function(N, p, rho = .5){
X <- matrix(rnorm(N * p), nc = p)
Sig <- diag(p)
Sig[lower.tri(Sig)] <- Sig[upper.tri(Sig)] <- rho
X <- X %*% chol(Sig)
apply(X, 2, pnorm)
}
# simulates data
additiv_sim <- function(N, p){
p <- max(5L, p)
x <- get_covs(N = N, p = p)
y <- .1 * exp(4 * x[, 1]) + 4 / (1 + exp(-20 * (x[, 2] - 1/2))) +
3 * x[, 3] + 2 * x[, 4] + x[, 5] + rnorm(N)
data.frame(y = y, x)
}
Then we run the simulations.
library(earth)
library(Oumuamua)
# parameters in simulation
test_size <- 10000
N <- c(100, 200, 500)
p <- 10
# returns minspan and endspan arguments. Similar to suggestion in Friedman
# (1991) though without adjusting N for number of striclty positive elements
# in the basis function.
get_spans <- function(N, p, alpha = .05){
Np <- N * p
minspan <- as.integer(ceiling(-log2(-1/Np * log1p(-alpha)) / 2.5))
endspan <- as.integer(ceiling(3 -log2(alpha / p)))
c(minspan = minspan, endspan = endspan)
}
# functions to fit models
earth_call <- function(sims){
spans <- get_spans(N = nrow(sims), p = p)
earth(y ~ ., data = sims, minspan = spans["minspan"],
endspan = spans["endspan"], degree = 1, penalty = 2)
}
oumua_call <- function(sims, n_threads = 5L){
spans <- get_spans(N = nrow(sims), p = p)
oumua(y ~ ., data = sims, control = oumua.control(
minspan = spans["minspan"], endspan = spans["endspan"], degree = 1L,
penalty = 2, lambda = 1, n_threads = n_threads))
}
# run simulations
set.seed(3779892)
res <- lapply(N, function(N_i){
# data used for validation
test_dat <- additiv_sim(test_size, p)
replicate(1000, {
# simulate
sims <- additiv_sim(N_i, p)
# fit models
efit <- earth_call(sims)
ofit <- oumua_call(sims)
# compute MSE and return
mse <- function(fit){
yhat <- predict(fit, newdata = test_dat)
mean((test_dat$y - yhat)^2)
}
c(earth = mse(efit), oumua = mse(ofit))
})
})
We make 1000 simulation above for the different sample sizes in the vector N
. We use p
covariates (10) of which only 5 are associated with the outcome. We only allow for an additive model by setting degree = 1
. The penalty
argument is parameter in the generalized cross validation criteria mentioned in Friedman (1991, 19–22). The lambda
parameter is the L2 penalty mentioned in Friedman (1991, 32). The mean squared errors for each sample size is shown below.
# stats for mean square error
names(res) <- N
lapply(res, function(x) apply(x, 1, function(z)
c(mean = mean(z), `standard error` = sd(z) / sqrt(length(z)))))
#> $`100`
#> earth oumua
#> mean 1.441382 1.441227
#> standard error 0.005773 0.004895
#>
#> $`200`
#> earth oumua
#> mean 1.207552 1.193977
#> standard error 0.002465 0.001961
#>
#> $`500`
#> earth oumua
#> mean 1.0826603 1.0777904
#> standard error 0.0008677 0.0007799
A comparison of computation times with both 1 and 5 threads is given below.
library(microbenchmark)
set.seed(17039344)
addi_runtimes <- local({
run_dat <- additiv_sim(10000, 10)
microbenchmark(
earth = earth_call(run_dat),
`oumua (1 thread) ` = oumua_call(run_dat, n_threads = 1L),
`oumua (5 threads)` = oumua_call(run_dat, n_threads = 5L),
times = 100)
})
addi_runtimes
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> earth 58.74 60.67 64.58 61.71 63.78 91.39 100
#> oumua (1 thread) 54.92 56.56 57.58 57.60 58.60 60.41 100
#> oumua (5 threads) 26.75 27.61 28.66 28.03 29.28 33.69 100
Next, we consider the non-additive model in Friedman (1991, 37). The true model is
We define a function to simulate the covariates and outcomes.
interact_sim <- function(N, p){
p <- max(5L, p)
x <- get_covs(N = N, p = p)
y <- 10 * sin(pi * x[, 1] * x[, 2]) + 20 * (x[, 3] - 1/2)^2 +
10 * x[, 4] + 5 * x[, 5] + rnorm(N)
data.frame(y = y, x)
}
Then we perform the simulation.
# functions to fit models
earth_call <- function(sims){
spans <- get_spans(N = nrow(sims), p = p)
earth(y ~ ., data = sims, minspan = spans["minspan"],
endspan = spans["endspan"], degree = 3, penalty = 3, nk = 50,
fast.k = 20)
}
oumua_call <- function(sims, n_threads = 5L){
spans <- get_spans(N = nrow(sims), p = p)
oumua(y ~ ., data = sims, control = oumua.control(
minspan = spans["minspan"], endspan = spans["endspan"], degree = 3L,
penalty = 3, nk = 50L, lambda = 1, n_threads = n_threads, K = 20L))
}
# run simulations
set.seed(3779892)
res <- lapply(N, function(N_i){
# data used for validation
test_dat <- interact_sim(test_size, p)
replicate(1000, {
# simulate
sims <- interact_sim(N_i, p)
# fit models
efit <- earth_call(sims)
ofit <- oumua_call(sims)
# compute MSE and return
mse <- function(fit){
yhat <- predict(fit, newdata = test_dat)
mean((test_dat$y - yhat)^2)
}
c(earth = mse(efit), oumua = mse(ofit))
})
})
We have increased degree
to allow for interactions. We also increase penalty
as suggested in Friedman (1991) (though some further tuning might be needed). The fast.k
in the earth
and the K
is number of basis function that must be included before a queue is used as suggested in Friedman (1993). The mean squared error estimate is shown below.
# stats for mean square error
names(res) <- N
lapply(res, function(x) apply(x, 1, function(z)
c(mean = mean(z), `standard error` = sd(z) / sqrt(length(z)))))
#> $`100`
#> earth oumua
#> mean 2.4637 2.67555
#> standard error 0.0359 0.02785
#>
#> $`200`
#> earth oumua
#> mean 1.65725 1.485687
#> standard error 0.01222 0.005563
#>
#> $`500`
#> earth oumua
#> mean 1.251923 1.220515
#> standard error 0.002866 0.002431
A comparison of computation times with both 1 and 5 threads is given below.
set.seed(17039344)
inter_runtimes <- local({
run_dat <- interact_sim(10000, 10)
microbenchmark(
earth = earth_call(run_dat),
`oumua (1 thread) ` = oumua_call(run_dat, n_threads = 1L),
`oumua (5 threads)` = oumua_call(run_dat, n_threads = 5L),
times = 10)
})
inter_runtimes
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> earth 517.2 525.5 534.9 532.2 543.8 564.6 10
#> oumua (1 thread) 748.3 763.1 767.5 767.1 774.8 779.5 10
#> oumua (5 threads) 189.7 196.0 200.8 199.5 200.7 224.3 10
We add a dummy variable to the model from before in the last example. The code is very similar.
factor_sim <- function(N, p){
p <- max(5L, p)
x <- get_covs(N = N, p = p)
n_grp <- 5L
fac <- gl(n_grp, N / n_grp)
grp_effect <- seq(-3, 3, length.out = n_grp)
y <- 10 * sin(pi * x[, 1] * x[, 2]) + 20 * (x[, 3] - 1/2)^2 +
10 * x[, 4] + 5 * x[, 5] + grp_effect[as.integer(fac)] + rnorm(N)
data.frame(y = y, x, fac = fac)
}
# functions to fit models
earth_call <- function(sims){
spans <- get_spans(N = nrow(sims), p = p)
earth(y ~ ., data = sims, minspan = spans["minspan"],
endspan = spans["endspan"], degree = 3, penalty = 3, nk = 50,
fast.k = 20)
}
oumua_call <- function(sims, lambda = 1, n_threads = 5L){
spans <- get_spans(N = nrow(sims), p = p)
oumua(y ~ ., data = sims, control = oumua.control(
minspan = spans["minspan"], endspan = spans["endspan"], degree = 3L,
penalty = 3, lambda = lambda, nk = 50L, n_threads = n_threads, K = 20L))
}
# run simulations
set.seed(3779892)
res <- lapply(N, function(N_i){
# data used for validation
test_dat <- factor_sim(test_size, p)
replicate(1000, {
# simulate
sims <- factor_sim(N_i, p)
# fit models
efit <- earth_call(sims)
ofit <- oumua_call(sims)
# compute MSE and return
mse <- function(fit){
yhat <- predict(fit, newdata = test_dat)
mean((test_dat$y - yhat)^2)
}
c(earth = mse(efit), oumua = mse(ofit))
})
})
The mean squared errors are given below.
# stats for mean square error
names(res) <- N
lapply(res, function(x) apply(x, 1, function(z)
c(mean = mean(z), `standard error` = sd(z) / sqrt(length(z)))))
#> $`100`
#> earth oumua
#> mean 3.20111 4.57737
#> standard error 0.04855 0.05963
#>
#> $`200`
#> earth oumua
#> mean 1.73350 1.633235
#> standard error 0.01344 0.008144
#>
#> $`500`
#> earth oumua
#> mean 1.306793 1.266576
#> standard error 0.003578 0.002915
A comparison of computation times with both 1 and 5 threads is given below.
set.seed(17039344)
factor_runtimes <- local({
run_dat <- factor_sim(10000, 10)
microbenchmark(
earth = earth_call(run_dat),
`oumua (1 thread) ` = oumua_call(run_dat, n_threads = 1L),
`oumua (5 threads)` = oumua_call(run_dat, n_threads = 5L),
times = 10)
})
factor_runtimes
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> earth 1017.6 1068.2 1083.3 1075.4 1102.8 1181.4 10
#> oumua (1 thread) 1034.9 1073.3 1105.9 1101.8 1138.1 1218.0 10
#> oumua (5 threads) 271.2 306.2 335.9 321.6 367.1 441.1 10
Below, we will check the result for the model
We only use one large sample to compare the performance for larger samples. The estimated models with have more basis functions due to the model and the larger sample. Thus, the suggestions in Friedman (1993) is more important.
interact_sim <- function(N, p){
p <- max(5L, p)
x <- get_covs(N = N, p = p)
y <- sin(pi * (x[, 1] + x[, 2] + x[, 3] )) +
sin(pi * ( x[, 2] + x[, 3] + x[, 4])) +
rnorm(N)
data.frame(y = y, x)
}
earth_call <- function(sims){
spans <- get_spans(N = nrow(sims), p = p)
earth(y ~ ., data = sims, minspan = spans["minspan"],
endspan = spans["endspan"], degree = 3, penalty = 3, nk = 100,
fast.k = 20)
}
oumua_call <- function(sims, lambda = 1, n_threads = 5L){
spans <- get_spans(N = nrow(sims), p = p)
oumua(y ~ ., data = sims, control = oumua.control(
minspan = spans["minspan"], endspan = spans["endspan"], degree = 3L,
penalty = 3, lambda = lambda, nk = 100L, n_threads = n_threads, K = 20L,
n_save = 3L))
}
# run simulations
set.seed(3779892)
res <- lapply(10000, function(N_i){
# data used for validation
test_dat <- interact_sim(test_size, p)
replicate(100, {
# simulate
sims <- interact_sim(N_i, p)
# fit models
efit <- earth_call(sims)
ofit <- oumua_call(sims)
# compute MSE and return
mse <- function(fit){
yhat <- predict(fit, newdata = test_dat)
mean((test_dat$y - yhat)^2)
}
c(earth = mse(efit), oumua = mse(ofit))
})
})
We also use the n_save
feature described in Friedman (1993, sec. 4). The mean squared errors are given below.
# stats for mean square error
lapply(res, function(x) apply(x, 1, function(z)
c(mean = mean(z), `standard error` = sd(z) / sqrt(length(z)))))
#> [[1]]
#> earth oumua
#> mean 1.0461485 1.0489324
#> standard error 0.0007171 0.0008054
A comparison of computation times with both 1 and 5 threads is given below.
set.seed(17039344)
deep_runtimes <- local({
run_dat <- interact_sim(10000, 10)
microbenchmark(
earth = earth_call(run_dat),
`oumua (1 thread) ` = oumua_call(run_dat, n_threads = 1L),
`oumua (5 threads)` = oumua_call(run_dat, n_threads = 5L),
times = 10)
})
deep_runtimes
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> earth 1567.2 1575.6 1615.5 1590.4 1614.7 1767.6 10
#> oumua (1 thread) 1539.8 1579.7 1594.6 1591.0 1602.4 1683.6 10
#> oumua (5 threads) 404.8 410.5 440.9 426.1 454.7 506.7 10
Some of the main differences between this package and the earth
package is
- the
earth
package has many more features! - the
earth
package is single threaded (as of this writing). - the
earth
package does not include an L2 penalty which simplifies the computation of the generalized cross validation criterion at each knot. - the
earth
package creates an orthogonal design matrix during the estimation which allows one (I think?) to skip some computations of the generalized cross validation criterion (a "full" forward and backward substitution). Put differently, there are no forward or backward substitutions in the C function that finds the knots in theearth
package.
Some computation can be skipped if one sets lambda
to zero (i.e., no L2 penalty). The following code blocks shows the impact.
set.seed(17039344)
factor_runtimes <- local({
run_dat <- factor_sim(10000, 10)
microbenchmark(
earth = earth_call(run_dat),
`oumua (1 thread) ` = oumua_call(run_dat, n_threads = 1L, lambda = 0),
`oumua (5 threads)` = oumua_call(run_dat, n_threads = 5L, lambda = 0),
times = 10)
})
factor_runtimes
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> earth 1114.3 1119.9 1145.4 1138.1 1147.5 1232.4 10
#> oumua (1 thread) 897.1 906.7 929.1 933.3 943.6 961.4 10
#> oumua (5 threads) 236.8 237.9 243.4 240.7 245.3 265.6 10
Settings lambda = 0
yields one less back substitution for each knot position. However, this is not preferred as the implementation is not numerical stable in some cases.
A final issue that still needs to be addressed is the L2 penalty in this package. When the knot position is found then the L2 penalty is applied to the coefficients in the model
where is the knot and are the other terms already included in the model. However, the final L2 penalty is applied to the coefficients in the model
which does not yield an equivalent model. The former is faster during the knot estimation while the latter is faster later as it yields a more sparse design matrix and thus faster computation times later.
Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” Ann. Statist. 19 (1). The Institute of Mathematical Statistics: 1–67. https://doi.org/10.1214/aos/1176347963.
———. 1993. “Fast Mars.” Technical Report 110. Stanford University Department of Statistics.