odow / SDDP.jl

Stochastic Dual Dynamic Programming in Julia

Home Page:https://sddp.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gurobi out of memory error on HPC

pauleseifert opened this issue · comments

Hi!

I run into the following problem when trying to solve my problem on a HPC with parallelisation. The error appears on the HPC only, on my PC it runs fine, is however constraint to too few iterations by the working memory. The error code is:

ERROR: LoadError: Gurobi Error 10001:
Stacktrace:
[1] _check_ret
@ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:400 [inlined]
[2] Gurobi.Env(; output_flag::Int64, memory_limit::Nothing, started::Bool)
@ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:110
[3] Env
@ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:102 [inlined]
[4] Gurobi.Optimizer(env::Nothing; enable_interrupts::Bool)
@ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:331
[5] Optimizer
@ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:325 [inlined]
[6] Gurobi.Optimizer()
@ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:325
[7] _instantiate_and_check(optimizer_constructor::Any)
@ MathOptInterface ~/.julia/packages/MathOptInterface/864xP/src/instantiate.jl:94
[8] instantiate(optimizer_constructor::Any; with_bridge_type::Type{Float64}, with_cache_type::Nothing)
@ MathOptInterface ~/.julia/packages/MathOptInterface/864xP/src/instantiate.jl:175
[9] set_optimizer(model::Model, optimizer_constructor::Any; add_bridges::Bool)
@ JuMP ~/.julia/packages/JuMP/H2SWp/src/optimizer_interface.jl:361
[10] set_optimizer
@ ~/.julia/packages/JuMP/H2SWp/src/optimizer_interface.jl:354 [inlined]
[11] _initialize_solver(node::SDDP.Node{Tuple{Int64, Int64}}; throw_error::Bool)
@ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:325
[12] _initialize_solver
@ ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:308 [inlined]
[13] _initialize_solver(model::SDDP.PolicyGraph{Tuple{Int64, Int64}}; throw_error::Bool)
@ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:343
[14] _initialize_solver
@ ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:341 [inlined]
[15] master_loop(async::SDDP.Asynchronous, model::SDDP.PolicyGraph{Tuple{Int64, Int64}}, options::SDDP.Options{Tuple{Int64, Int64}})
@ SDDP ~/.julia/packages/SDDP/PZElX/src/plugins/parallel_schemes.jl:238
[16] train(model::SDDP.PolicyGraph{Tuple{Int64, Int64}}; iteration_limit::Int64, time_limit::Nothing, print_level::Int64, log_file::String, log_frequency::Int64, log_every_seconds::Float64, run_numerical_stability_report::Bool, stopping_rules::Vector{SDDP.AbstractStoppingRule}, risk_measure::SDDP.Expectation, sampling_scheme::SDDP.InSampleMonteCarlo, cut_type::SDDP.CutType, cycle_discretization_delta::Float64, refine_at_similar_nodes::Bool, cut_deletion_minimum::Int64, backward_sampling_scheme::SDDP.CompleteSampler, dashboard::Bool, parallel_scheme::SDDP.Asynchronous, forward_pass::SDDP.DefaultForwardPass, forward_pass_resampling_probability::Nothing, add_to_existing_cuts::Bool, duality_handler::SDDP.ContinuousConicDuality, forward_pass_callback::SDDP.var"#97#104", post_iteration_callback::SDDP.var"#98#105")
@ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:1100
[17] top-level scope
@ ~/SDDP/Versjon_Paul_other_prices.jl:295
in expression starting at /home/paules/SDDP/Versjon_Paul_other_prices.jl:295
From worker 16: Set parameter TokenServer to value "10.1.1.1"
┌ Warning: Forcibly interrupting busy workers
│ exception = rmprocs: pids [15, 16] not terminated after 5.0 seconds.
└ @ Distributed /share/apps/Julia/1.9.2-linux-x86_64/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1253
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /share/apps/Julia/1.9.2-linux-x86_64/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1049

The HPC runs on Rocks 7.0 and has 384GB RAM on the instance I'm running the code on.
Julia@1.9.2
Gurobi@10.0.2
SDDP@1.6.0

The problem persists across different Gurobi and Julia versions. Also, different machines with the same operating system throw the same error message. A Serial version of the problem runs but takes very long to reach the iteration limit.
I call the training function with:

SDDP.train(
model,
iteration_limit = 3000,
parallel_scheme = SDDP.Asynchronous() do m::SDDP.PolicyGraph
env = Gurobi.Env()
GRBsetdblparam(env, "OutputFlag", 0)
GRBsetdblparam(env, "LogToConsole", 0)
set_optimizer(m, () -> Gurobi.Optimizer(env))
add_to_existing_cuts = true
end,
)

Parameters are global and I load the additional workers at the beginning of the file:

using Distributed
Distributed.addprocs(5)
@Everywhere using Gurobi
@Everywhere using SDDP

Any ideas on how to fix this?

These things are very hard to debug. I don't have any real suggestions. Improving the parallel support is on my TODO list: #599.

  • Do you have the header from a serial solve?
  • Have you verified that you actually have access to all 384 GB of RAM? You didn't start a job with a limited resource capacity?

I even have the header from the Asynchronous solve.

-------------------------------------------------------------------
         SDDP.jl (c) Oscar Dowson and contributors, 2017-23
-------------------------------------------------------------------
problem
nodes           : 2521
state variables : 75
scenarios       : 1.19768e+29
existing cuts   : false
options
solver          : Asynchronous mode with 5 workers.
risk measure    : SDDP.Expectation()
sampling scheme : SDDP.InSampleMonteCarlo
subproblem structure
VariableRef                             : [168, 168]
AffExpr in MOI.EqualTo{Float64}         : [13, 18]
AffExpr in MOI.LessThan{Float64}        : [44, 46]
VariableRef in MOI.EqualTo{Float64}     : [32, 77]
VariableRef in MOI.GreaterThan{Float64} : [21, 60]
VariableRef in MOI.LessThan{Float64}    : [17, 56]
numerical stability report
matrix range     [1e+00, 1e+00]
objective range  [1e+00, 1e+04]
bounds range     [1e+01, 2e+07]
rhs range        [5e+00, 1e+01]
WARNING: numerical stability issues detected
- bounds range contains large coefficients
Very large or small absolute values of coefficients
can cause numerical stability issues. Consider
reformulating the model.

The problem starts after the Gurobi instances have been initialised (I get quite a few non-muteable licence dialogues). It seems to be tied to Linux only. Do you know of any additional verbose options for Gurobi for troubleshooting? I tried to alter the gurobi environment storage limit with GRBsetdblparam(env, "MemLimit", 15.0) but this didn't help.

There is no batch scheduler installed and I should have full access to the server.
julia> Sys.free_memory() / 2^20
353859.84375

Happy to share more infos if useful!

So the problem is this:

Your model has 2521 nodes, and you are running with five workers. Since SDDP.jl doesn't share models between workers, that means SDDP.jl is going to create 2521 * 5 = 12,605 Gurobi models. This gives you ~30 Mb of memory for each node, but in that we also need to store some some of that is taken up with the model and some is taken up with SDDP related things. So it's plausible that you actually are running out of memory with this problem.

Why do you have so many nodes in the graph? Is it a Markovian graph?

Yes the problem is Markovian and has 42+1 stages where consecutive decisions can be made. The whole point of the exercise is to try something new out that is larger than existing applications. However, it runs with the same parameters on my MBP with 32GB of physical RAM and little to no caching. I don't expect macOS to do any magic in memory management.

However, it runs with the same parameters

Including 5 parallel threads?

and little to no caching. I don't expect macOS to do any magic in memory management

If it works on your Mac and not on the HPC, then I don't know if this is easy for me to test or debug. Have you looked at actual RAM usage with top on your big machine when running?

Just let it run serial and wait a bit longer. The 60 Markov states don't slow it down too much, because the backwards pass implements a trick that updates every node in the stage at each iteration.

Without a reproducible example, I don't know if there's much we can do here. I'm tempted to close this issue and mark it as a request for #599.

The problem was caused by the file system on the cluster. Moving to another drive solved the issue. Nothing wrong with your code :)