AD with ForwardMode is not thread-safe

Question

AD with ForwardMode is not thread-safe

sethaxen opened this issue a year ago · comments

When using multiple threads, the gradient computations disagree with the gradients one gets when using a single thread. Here's an MWE:

using ForwardDiff, LinearAlgebra, LogDensityProblems, LogDensityProblemsAD

struct LogDensityFunction{F}
    logp::F
    dim::Int
end
function LogDensityProblems.capabilities(::Type{<:LogDensityFunction})
    return LogDensityProblems.LogDensityOrder{0}()
end
LogDensityProblems.dimension(ℓ::LogDensityFunction) = ℓ.dim
LogDensityProblems.logdensity(ℓ::LogDensityFunction, x) = ℓ.logp(x)

dim = 10
n = 10^6
P = LogDensityFunction(x -> -sum(abs2, x) / 2, dim)
ℓ = ADgradient(:ForwardDiff, P)
x = randn(dim, n)

results_single = zeros(dim, n)
results_threaded = zeros(dim, n)

for j in 1:n
    results_single[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

Threads.@threads for j in 1:n
    results_threaded[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

On my machine (using 4 threads), I get the following result:

julia> norm(results_single .- results_threaded)
2588.220046954836

This appears to be the same issue as JuliaDiff/ForwardDiff.jl#573, where the signature ForwardDiff.gradient(f, x, cfg), which is used here, is shown to be not thread-safe. By comparison, ReverseDiff seems to be fine.

Seth Axen · Answer 1 · Tue Jan 31 2023 21:29:40 GMT+0800 (China Standard Time)

I suspect the issue is with a shared GradientConfig. AbstractDifferentiation creates a GradientConfig in its gradient function and I don't think suffers from this issue: https://github.com/JuliaDiff/AbstractDifferentiation.jl/blob/eb5d913b9e4cbd31465af4ee2a75acb7f69ded91/src/forwarddiff.jl#LL47C57-L47C57

David Widmann · Answer 2 · Tue Jan 31 2023 21:34:25 GMT+0800 (China Standard Time)

IMO that seems to be a ForwardDiff issue? At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

David Widmann · Answer 3 · Tue Jan 31 2023 21:37:44 GMT+0800 (China Standard Time)

Regarding AbstractDifferentiation, if I remember correctly the only reason why it creates a config at all is to set the chunk size. Otherwise they would not be needed.

Seth Axen · Answer 4 · Tue Jan 31 2023 21:56:12 GMT+0800 (China Standard Time)

At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

This is the documented behavior. It seems the main purpose of the config is to allocate work buffers: https://juliadiff.org/ForwardDiff.jl/stable/user/api/#Preallocating/Configuring-Work-Buffers. See also this comment: JuliaDiff/ForwardDiff.jl#573 (comment)

David Widmann · Answer 5 · Tue Jan 31 2023 22:03:47 GMT+0800 (China Standard Time)

I see. Then probably we should add a note in LogDensityProblemsAD as well that it is not threadsafe.

David Widmann · Answer 6 · Tue Jan 31 2023 22:08:23 GMT+0800 (China Standard Time)

Since it is much more efficient to preallocate a config, as stated in the ForwardDiff docs

ForwardDiff's basic API methods will allocate these types automatically by default, but you can drastically reduce memory usage if you preallocate them yourself.

I think generally one wants to create and use the config object, also within LogDensityProblemsAD.
To me it seems that it's not thread-safe seems to be a design (issue?) of ForwardDiff, so users should be aware of it; but making it inefficient for everyone - similar to the code in AbstractDifferentiation - even though here in contrast to AbstractDifferentiation we know the function AND the size of the input, seems undesirable to me.

Tamas K. Papp · Answer 7 · Tue Jan 31 2023 22:23:52 GMT+0800 (China Standard Time)

I can think of the following (not mutually exclusive) solutions:

warn users about this, suggesting that they create an AD'd object for each thread, but that only covers simple use cases like MCMC (however, that may just work for 99% of our users),
provide an option w/o preallocated buffers, slower but safe

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

David Widmann · Answer 8 · Tue Jan 31 2023 22:31:09 GMT+0800 (China Standard Time)

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

That is not guaranteed to work anymore in newer Julia versions, generally it is not safe to use Threads.threadid() in such cases. See e.g. https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid and the discussion in https://discourse.julialang.org/t/behavior-of-threads-threads-for-loop/76042?u=devmotion.

David Widmann · Answer 9 · Tue Jan 31 2023 22:34:00 GMT+0800 (China Standard Time)

provide an option w/o preallocated buffers, slower but safe

Maybe the easiest option here would be to support passing gradientconfig = nothing in ADgradient, and handling it by constructing a new config object (to support setting chunk size nevertheless) when calling logdensity_and_gradient.

Tamas K. Papp · Answer 10 · Sun Feb 12 2023 16:24:17 GMT+0800 (China Standard Time)

I would suggest the following solution:

make the default threadsafe, if potentially inefficient, for all backends and document this,
offer the faster solution in ADgradient for the relevant backends, explicitly mentioning that it is not thread-safe,
implement a method for Base.copy that allows independently usable copies where applicable (eg gradients created with the ForwardDiff backend), and is otherwise a no-op.

This would allow users to take care of threading as they wish --- most MCMC applications would just create as many copies as they need, and sample in threads.

Incidentally, I do not understand how ReverseDiff is not affected. It uses the same kind of buffers.