predict.hdrda is slow for a large number of observations

Question

predict.hdrda is slow for a large number of observations

ramhiser opened this issue 9 years ago · comments

In some HDRDA sims, I noticed that the prediction of a large number of test observations was painfully slow after model selection. The model selection itself is quite fast, so it is counterintuitive that the easier prediction of test observations is the bottleneck.

John Ramey · Answer 1 · Mon Nov 02 2015 22:21:01 GMT+0800 (China Standard Time)

Reproducible example from sims:

sample_sizes <- rep(25, 3)
test_sizes <- rep(10000, 3)
i <- 1
p <- 100
mu <- 0.5
rho <- rep(0.1, 3)
contamination_prob <- 0
block_size <- 100
set.seed(i)
mu1 <- rep(0, p)
mu2 <- c(rep(mu, block_size), rep(0, p - block_size))
mu3 <- -mu2
num_blocks <- p / block_size
train_data <- generate_contaminated(n=sample_sizes,
                                    mu=cbind(mu1, mu2, mu3),
                                    block_size=block_size,
                                    num_blocks=num_blocks,
                                    rho=rho,
                                    uncontaminated_var=uncontaminated_var,
                                    contaminated_var=contaminated_var,
                                    contamination_prob=contamination_prob)


test_data <- generate_contaminated(n=test_sizes,
                                   mu=cbind(mu1, mu2, mu3),
                                   block_size=block_size,
                                   num_blocks=num_blocks,
                                   rho=autocorrelations,
                                   uncontaminated_var=uncontaminated_var,
                                   contaminated_var=contaminated_var,
                                   contamination_prob=contamination_prob)
test_x <- test_data$x
test_y <- test_data$y

num_classes <- nlevels(train_y)
prior_probs <- rep(1, num_classes) / num_classes

hdrda_ridge_errors <- try({
  cv_out <- hdrda_cv(x = train_x,
                     y = train_y,
                     prior = prior_probs)
  hdrda_ridge <- list(lambda = cv_out$lambda, gamma = cv_out$gamma)
  mean(predict(cv_out, test_x)$class != test_y)
})

John Ramey · Answer 2 · Mon Nov 02 2015 22:33:35 GMT+0800 (China Standard Time)

It turns out the culprit is in sparsediscrim:::predict.hdrda. The line quad_forms <- diag(quadform(class_est$W_inv, U1_x)) is slow because a 30K x 30K matrix is being constructed before the diagonal elements are extracted. 30K corresponds to the 30,000 test observations.

When stepping through predict.hdrda, I computed the following:

 Browse[2]> dim(U1_x)
 [1]    72 30000
 Browse[2]> dim(class_est$W_inv)
 [1] 72 72
 Browse[2]> quad_forms <- diag(quadform(class_est$W_inv, U1_x))
 Browse[2]> foo <- apply(U1_x, 2, function(z) quadform(class_est$W_inv, z))
!Browse[2]> summary(quad_forms - foo)
       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
 -2.132e-14 -1.776e-15  0.000e+00 -4.460e-18  1.776e-15  3.553e-14

Practically speaking, there is no difference between computing the quadratic form of the entire matrix vs doing this for each vector in turn. The latter is substantially faster though. While stepping through I ran a quick benchmark:

library(microbenchmark)

!Browse[2]> microbenchmark(matrix=diag(quadform(class_est$W_inv, U1_x)), vec=apply(U1_x, 2, function(z) quadform(class_est$W_inv, z)), times=20)
 Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
  matrix 3924.7077 4400.3033 4499.3932 4433.4784 4668.4003 4856.3770    20   b
     vec  267.7087  278.6358  302.1803  300.0553  316.4827  369.2671    20  a

The vector approach is roughly 15 times faster.