SciML / ExponentialUtilities.jl

Hi, I'm on Julia 1.8.2 and I'm trying to use exponential solvers on a moderately-sized problem. When I construct the linear operator via Diagonal, I notice some performance issues. I reported this in the past in SciML/OrdinaryDiffEq.jl#1193. It's been a while, but now I have some renewed interest, so I decided to dig a little deeper and managed to localize the problem here.

This is the benchmark I used in the past, where the nonlinear part is zero so that I can test against the exact solution. The accuracy problems are gone, but performance issues remain.

using LinearAlgebra
using DifferentialEquations
using DiffEqOperators
using OrdinaryDiffEq
using ExponentialUtilities
using BenchmarkTools
using Profile


N = 256^2
Arr = rand(N)
A = DiffEqArrayOperator(Diagonal(Arr))
Amat = convert(AbstractMatrix, A)
f!(du, u, p, t) = nothing
t1 = 1.0
u0 = ones(eltype(Arr), N)
u1 = exp.(Arr*t1) .* u0  # Exact solution.
prob! = SplitODEProblem(A, f!, u0, t1, nothing)
def_kwargs = Dict(:save_everystep=>false, :save_start=>false, :dt=>0.05)

ϵ(u_sol) = √sum(abs2.(u_sol - u1))  # RMS error to check the accuracy.

function test(prob, alg; kwargs...)
    sol = solve(prob, alg; kwargs...)
    @show ϵ(sol[end])
    @benchmark solve($prob, $alg; $kwargs...)
end;

Here's a comparison between LawsonEuler and ETDRK4.

test(prob!, LawsonEuler(); def_kwargs...)

ϵ(sol[end]) = 5.769346133408569e-13

BenchmarkTools.Trial: 573 samples with 1 evaluation.
 Range (min … max):  6.919 ms … 19.627 ms  ┊ GC (min … max): 0.00% … 61.09%
 Time  (median):     8.102 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.711 ms ±  2.148 ms  ┊ GC (mean ± σ):  5.39% ± 11.74%

    ▂▇██▇▅▄▄▂▂▁                                               
  ▄▆████████████▆▇▆▇▄▅▄▄▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▆█▅▆▄▄▄▁▄▄ ▇
  6.92 ms      Histogram: log(frequency) by time     18.6 ms <

 Memory estimate: 6.50 MiB, allocs estimate: 69.

test(prob!, ETDRK4(); def_kwargs...)

ϵ(sol[end]) = 1.8824825842372246e-13

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.813 s …   4.831 s  ┊ GC (min … max): 0.33% … 0.37%
 Time  (median):     4.822 s              ┊ GC (median):    0.35%
 Time  (mean ± σ):   4.822 s ± 12.724 ms  ┊ GC (mean ± σ):  0.35% ± 0.03%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.81 s         Histogram: frequency by time        4.83 s <

 Memory estimate: 171.01 MiB, allocs estimate: 1310838.

So LawsonEuler wins by orders of magnitude, both in memory allocations and time of execution. I know ETDRK4 is a much more complex method, but I wouldn't expect such a gap in performance.

I managed to identify a part of the problem with memory. It turns out that while scalar phi and exponential! can use caches to save memory, these caches are reallocated for each matrix element. This can be fixed by creating the caches already in phi!. So I monkey-patched these two functions

function ExponentialUtilities.phi(z::T, k::Integer; cache = nothing,
             expmethod = ExponentialUtilities.ExpMethodHigham2005(), cache2 = nothing) where {T <: Number}
    # Construct the matrix
    if cache == nothing
        cache = fill(zero(T), k + 1, k + 1)
    else
        fill!(cache, zero(T))
    end
    cache[1, 1] = z
    for i in 1:k
        cache[i, i + 1] = one(T)
    end
    if cache2 == nothing
        cache2 = ExponentialUtilities.alloc_mem(cache, expmethod)
    end
    P = exponential!(cache, expmethod, cache2)
    return P[1, :]
end

function ExponentialUtilities.phi!(out::Vector{Diagonal{T, V}}, A::Diagonal{T, V}, k::Integer;
              caches = nothing, expmethod = ExponentialUtilities.ExpMethodHigham2005()) where {T <: Number, V <: AbstractVector{T}}
    cache = fill(zero(T), k + 1, k + 1)
    cache2 = ExponentialUtilities.alloc_mem(cache, expmethod)
    for i in 1:size(A, 1)
        phiz = phi(A[i, i], k; cache = cache, expmethod = expmethod, cache2 = cache2)
        for j in 1:(k + 1)
            out[j][i, i] = phiz[j]
        end
    end
    return out
end

and got the following result

test(prob!, ETDRK4(); def_kwargs...)

ϵ(sol[end]) = 1.8832000766134335e-13

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.923 s …    5.065 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.994 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.994 s ± 100.383 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.92 s         Histogram: frequency by time         5.06 s <

 Memory estimate: 40.01 MiB, allocs estimate: 262278.

Memory usage went down noticeably, but the method is as slow as before. Running the profiler on phi! I found that most of the time is spent on LU decompositions in ldiv_for_generated!, this line specifically:

ExponentialUtilities.jl/src/exp_noalloc.jl

Line 29 in 380125a

    
           F = lu!(A) # This allocation is unavoidable, due to the interface of LinearAlgebra

At this point I'm not familiar enough with the package to continue. Can anything be done to speed up this method? Perhaps the comment about unavoidable LU is not true anymore? Or have I run into the limits of performance of exponential solvers and they simply do not scale to problems of this size? Should I use another method? LawsonEuler, which seems to work fine in this simple toy example, unfortunately isn't accurate enough in the real problem. I also tried ETDRK4(krylov=true), but it's only 50% faster and seems to allocate slightly more memory, which maybe points to issues with caching there as well.

Sorry for such a long post, I kinda want this to work :)

At this point I'm not familiar enough with the package to continue. Can anything be done to speed up this method? Perhaps the comment about unavoidable LU is not true anymore?

You can use LinearSolve.jl here and see if there's a faster LU for your computer.

Diagonal matrices could probably have a specialization that avoid most of the computation since diagonal matrices should do all of this element-wise, so it could have a very quick specialization.

Performance of `phi!` on diagonal matrices