Performance of `phi!` on diagonal matrices
miromarszal opened this issue · comments
Hi, I'm on Julia 1.8.2 and I'm trying to use exponential solvers on a moderately-sized problem. When I construct the linear operator via Diagonal
, I notice some performance issues. I reported this in the past in SciML/OrdinaryDiffEq.jl#1193. It's been a while, but now I have some renewed interest, so I decided to dig a little deeper and managed to localize the problem here.
This is the benchmark I used in the past, where the nonlinear part is zero so that I can test against the exact solution. The accuracy problems are gone, but performance issues remain.
using LinearAlgebra
using DifferentialEquations
using DiffEqOperators
using OrdinaryDiffEq
using ExponentialUtilities
using BenchmarkTools
using Profile
N = 256^2
Arr = rand(N)
A = DiffEqArrayOperator(Diagonal(Arr))
Amat = convert(AbstractMatrix, A)
f!(du, u, p, t) = nothing
t1 = 1.0
u0 = ones(eltype(Arr), N)
u1 = exp.(Arr*t1) .* u0 # Exact solution.
prob! = SplitODEProblem(A, f!, u0, t1, nothing)
def_kwargs = Dict(:save_everystep=>false, :save_start=>false, :dt=>0.05)
ϵ(u_sol) = √sum(abs2.(u_sol - u1)) # RMS error to check the accuracy.
function test(prob, alg; kwargs...)
sol = solve(prob, alg; kwargs...)
@show ϵ(sol[end])
@benchmark solve($prob, $alg; $kwargs...)
end;
Here's a comparison between LawsonEuler
and ETDRK4
.
test(prob!, LawsonEuler(); def_kwargs...)
ϵ(sol[end]) = 5.769346133408569e-13
BenchmarkTools.Trial: 573 samples with 1 evaluation.
Range (min … max): 6.919 ms … 19.627 ms ┊ GC (min … max): 0.00% … 61.09%
Time (median): 8.102 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.711 ms ± 2.148 ms ┊ GC (mean ± σ): 5.39% ± 11.74%
▂▇██▇▅▄▄▂▂▁
▄▆████████████▆▇▆▇▄▅▄▄▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▆█▅▆▄▄▄▁▄▄ ▇
6.92 ms Histogram: log(frequency) by time 18.6 ms <
Memory estimate: 6.50 MiB, allocs estimate: 69.
test(prob!, ETDRK4(); def_kwargs...)
ϵ(sol[end]) = 1.8824825842372246e-13
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 4.813 s … 4.831 s ┊ GC (min … max): 0.33% … 0.37%
Time (median): 4.822 s ┊ GC (median): 0.35%
Time (mean ± σ): 4.822 s ± 12.724 ms ┊ GC (mean ± σ): 0.35% ± 0.03%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
4.81 s Histogram: frequency by time 4.83 s <
Memory estimate: 171.01 MiB, allocs estimate: 1310838.
So LawsonEuler
wins by orders of magnitude, both in memory allocations and time of execution. I know ETDRK4
is a much more complex method, but I wouldn't expect such a gap in performance.
I managed to identify a part of the problem with memory. It turns out that while scalar phi
and exponential!
can use caches to save memory, these caches are reallocated for each matrix element. This can be fixed by creating the caches already in phi!
. So I monkey-patched these two functions
function ExponentialUtilities.phi(z::T, k::Integer; cache = nothing,
expmethod = ExponentialUtilities.ExpMethodHigham2005(), cache2 = nothing) where {T <: Number}
# Construct the matrix
if cache == nothing
cache = fill(zero(T), k + 1, k + 1)
else
fill!(cache, zero(T))
end
cache[1, 1] = z
for i in 1:k
cache[i, i + 1] = one(T)
end
if cache2 == nothing
cache2 = ExponentialUtilities.alloc_mem(cache, expmethod)
end
P = exponential!(cache, expmethod, cache2)
return P[1, :]
end
function ExponentialUtilities.phi!(out::Vector{Diagonal{T, V}}, A::Diagonal{T, V}, k::Integer;
caches = nothing, expmethod = ExponentialUtilities.ExpMethodHigham2005()) where {T <: Number, V <: AbstractVector{T}}
cache = fill(zero(T), k + 1, k + 1)
cache2 = ExponentialUtilities.alloc_mem(cache, expmethod)
for i in 1:size(A, 1)
phiz = phi(A[i, i], k; cache = cache, expmethod = expmethod, cache2 = cache2)
for j in 1:(k + 1)
out[j][i, i] = phiz[j]
end
end
return out
end
and got the following result
test(prob!, ETDRK4(); def_kwargs...)
ϵ(sol[end]) = 1.8832000766134335e-13
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 4.923 s … 5.065 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.994 s ┊ GC (median): 0.00%
Time (mean ± σ): 4.994 s ± 100.383 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
4.92 s Histogram: frequency by time 5.06 s <
Memory estimate: 40.01 MiB, allocs estimate: 262278.
Memory usage went down noticeably, but the method is as slow as before. Running the profiler on phi!
I found that most of the time is spent on LU decompositions in ldiv_for_generated!
, this line specifically:
At this point I'm not familiar enough with the package to continue. Can anything be done to speed up this method? Perhaps the comment about unavoidable LU is not true anymore? Or have I run into the limits of performance of exponential solvers and they simply do not scale to problems of this size? Should I use another method? LawsonEuler
, which seems to work fine in this simple toy example, unfortunately isn't accurate enough in the real problem. I also tried ETDRK4(krylov=true)
, but it's only 50% faster and seems to allocate slightly more memory, which maybe points to issues with caching there as well.
Sorry for such a long post, I kinda want this to work :)
At this point I'm not familiar enough with the package to continue. Can anything be done to speed up this method? Perhaps the comment about unavoidable LU is not true anymore?
You can use LinearSolve.jl here and see if there's a faster LU for your computer.
Diagonal matrices could probably have a specialization that avoid most of the computation since diagonal matrices should do all of this element-wise, so it could have a very quick specialization.