Slow Convergence Times For Local Gradient Based Algos
jjoba opened this issue · comments
Hi! I've been using NLOPTR quite a bit recently and have noticed that the convergence for local gradient-based algorithms is slower than I would expect. For example in the code below, NLOPT_LD_LBFGS
takes 185 iterations to converge where as L-BFGS-B
in base R only takes 27. I also tried out NLOPT_LD_SLSQP
and NLOPT_LD_MMA
and both seem to get "stuck" or VERY slowly approach the optimum (if ever reaching). In addition, SLSQP will occasionally give radically different solutions on sequential runs (can't reproduce). SLSQP & MMA are both appealing due to the ability to include non-linear constraints.
Below is a reproducible example.
If there is a parameter I'm missing that would help or if you need additional information, please let me know. Thanks in advance for looking into this.
`
R version 4.2.2 64 bit on Windows 11
library(nloptr) # 2.0.3
set.seed(1234)
n <- 300 # arbitrarily large number for testing
x <- exp(rnorm(n))
b <- exp(rnorm(n)) # beta/coef
Objective function to minimize
f_obj <- function(x, n, b){
-sum(b * log(x) - x)
}
Gradient of the objective function
f_grad <- function(x, n, b){
n - ((b * n) / x)
}
Benchmark using base R implementation; takes 27 iterations to converge
optim(rep(2, n), f_obj, gr = f_grad, method = "L-BFGS-B", n = n, b = b, lower = rep(1, n))
Same as above using nloptr; takes 185 iterations to converge
nloptr(
x0 = rep(2, n),
eval_f = f_obj,
eval_grad_f = f_grad,
lb = rep(1, n),
opts = list(
"algorithm" = "NLOPT_LD_LBFGS",
"xtol_rel" = 1e-08, # set to mimic optim's tolerance, roughly equivalent to factr = 1e7
"maxeval" = 200), # set to enable convergence
n = n,
b = b
)
Optimal value of objective function = ~46.07
MMA makes progress in the first few iterations, then gets stuck at 70 and terminates at iteration 54
nloptr(
x0 = rep(2, n),
eval_f = f_obj,
eval_grad_f = f_grad,
lb = rep(1, n),
opts = list(
"algorithm" = "NLOPT_LD_MMA",
"xtol_rel" = 1e-08,
"maxeval" = 200,
"print_level" = 3), # to show optimization path
n = n,
b = b
)
SLSQP makes inefficient choices along the way -- each step is typically worse, not better, but does eventually get close to the optimum
nloptr(
x0 = rep(2, n),
eval_f = f_obj,
eval_grad_f = f_grad,
lb = rep(1, n),
opts = list(
"algorithm" = "NLOPT_LD_SLSQP",
"xtol_rel" = 1e-08,
"maxeval" = 200, # at 10K it will get down to 47, seemingly by luck
"print_level" = 3), # set to enable convergence
n = n,
b = b
)
`
Are you certain your gradient is correct? Shouldn't it be
I just double checked on wolfram alpha and got
I think the reason for the difference is that the gradient supplied is
In case it's helpful, I was just able to get MMA to converge by running it in a loop and passing the ending parameters of the prior iteration to the next.
`
best_solution <- rep(2, n)
best_obj <- Inf
i <- 1
while(TRUE){
print(paste("Iteration:", i, "Obj:", best_obj))
res <- nloptr(
x0 = best_solution,
eval_f = f_obj,
eval_grad_f = f_grad,
lb = rep(1, n),
opts = list(
"algorithm" = "NLOPT_LD_MMA",
"ftol_rel" = 1.0e-16,
"print_level" = 0,
"maxeval" = 1000),
n = n,
b = b
)
if(res$objective < best_obj){
best_obj <- res$obj
best_solution <- res$solution
i <- i + 1
} else{
break
}
}
``
I just double checked on wolfram alpha and got n−bnx
I think the reason for the difference is that the gradient supplied is w.r.t. each element in the vector x
I think that there may be a mistake as you're summing over
Consider the function on an individual observation:
The derivative of that is, if I remember my calculus from 30-odd years ago,
Furthermore, I think you're using the optimization incorrectly.
First, what precisely are you trying to optimize? Is that a loss distribution, is that a cost function which is always non-negative? If you can describe you objective that would help. Looking at the log-space form, the "normal space" form would need to be
What it looks like you are doing from your f_obj
is trying to solve for a separate
Perhaps you meant to find a SINGLE
So forgive me, but I'm struggling to understand just what is being minimized and if it even can be. I admit, although I use calculus every now and then, my last rigorous exposure was about 30 years ago, so I may be rusty.
Thanks for the follow up!
Descriptively, here's what I'm trying to solve:
This is inspired by Media Mix Modelling (or more generally budget allocation). The goal is to find how much to "spend" (x) in each media channel (i) to maximize net revenue (return - channel spend). Although each channel has diminishing returns as spend increases (modeled by ln(x)), but the slope/effectiveness of each channel is different (b - using exponentiated normal here just so that all of the coefficients are positive).
Ultimately, the goal is to maximize net revenue
You're absolutely right about the derivative of each element. Interestingly, when I supply the gradient you derived to L-FBGS-B I get an abnormal termination in lnsrch error.
So your spends are the
Maximize
where
My suggestion is, if you can, start with very few variables. Even better, start with a restricted set for which you can analytically solve, if possible, and see if the algorithm converges. Only then scale up.
Also, given that I understand the problem a bit better, if you are trying to minimize
Thanks for all of the suggestions. I just had a new baby, so it will be quite a while before I'm able to follow up here on this again.
Based on where we're landing this isn't truly an issue with NLOPTR and more an issue with problem formulation, so closing out the issue issue so it doesn't remain in open status for too long. I'll follow backup if I'm still encountering issues after I get back to the project.