ramhiser / sparsediscrim

Sparse and Regularized Discriminant Analysis in R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

predict.hdrda is slow for a large number of observations

ramhiser opened this issue · comments

In some HDRDA sims, I noticed that the prediction of a large number of test observations was painfully slow after model selection. The model selection itself is quite fast, so it is counterintuitive that the easier prediction of test observations is the bottleneck.

Reproducible example from sims:

sample_sizes <- rep(25, 3)
test_sizes <- rep(10000, 3)
i <- 1
p <- 100
mu <- 0.5
rho <- rep(0.1, 3)
contamination_prob <- 0
block_size <- 100
mu1 <- rep(0, p)
mu2 <- c(rep(mu, block_size), rep(0, p - block_size))
mu3 <- -mu2
num_blocks <- p / block_size
train_data <- generate_contaminated(n=sample_sizes,
                                    mu=cbind(mu1, mu2, mu3),

test_data <- generate_contaminated(n=test_sizes,
                                   mu=cbind(mu1, mu2, mu3),
test_x <- test_data$x
test_y <- test_data$y

num_classes <- nlevels(train_y)
prior_probs <- rep(1, num_classes) / num_classes

hdrda_ridge_errors <- try({
  cv_out <- hdrda_cv(x = train_x,
                     y = train_y,
                     prior = prior_probs)
  hdrda_ridge <- list(lambda = cv_out$lambda, gamma = cv_out$gamma)
  mean(predict(cv_out, test_x)$class != test_y)

It turns out the culprit is in sparsediscrim:::predict.hdrda. The line quad_forms <- diag(quadform(class_est$W_inv, U1_x)) is slow because a 30K x 30K matrix is being constructed before the diagonal elements are extracted. 30K corresponds to the 30,000 test observations.

When stepping through predict.hdrda, I computed the following:

 Browse[2]> dim(U1_x)
 [1]    72 30000
 Browse[2]> dim(class_est$W_inv)
 [1] 72 72
 Browse[2]> quad_forms <- diag(quadform(class_est$W_inv, U1_x))
 Browse[2]> foo <- apply(U1_x, 2, function(z) quadform(class_est$W_inv, z))
!Browse[2]> summary(quad_forms - foo)
       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
 -2.132e-14 -1.776e-15  0.000e+00 -4.460e-18  1.776e-15  3.553e-14

Practically speaking, there is no difference between computing the quadratic form of the entire matrix vs doing this for each vector in turn. The latter is substantially faster though. While stepping through I ran a quick benchmark:


!Browse[2]> microbenchmark(matrix=diag(quadform(class_est$W_inv, U1_x)), vec=apply(U1_x, 2, function(z) quadform(class_est$W_inv, z)), times=20)
 Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
  matrix 3924.7077 4400.3033 4499.3932 4433.4784 4668.4003 4856.3770    20   b
     vec  267.7087  278.6358  302.1803  300.0553  316.4827  369.2671    20  a

The vector approach is roughly 15 times faster.