` AbstractIrrational` does not play nice with CUDA

Question

` AbstractIrrational` does not play nice with CUDA

Red-Portal opened this issue a year ago · comments

Hi, it seems that many of the functions are not compatible with CUDA.jl out of the box due to dynamic precision (?). Here's a MWE:

LogExpFunctions.log1mexp.(CuVector([-1f0, -2f0, -3f0]))

ERROR: InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#26")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(log1mexp), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to var"#setprecision#25"(kws::Base.Pairs{Symbol, V, Tuple{Vararg{Symbol, N}}, NamedTuple{names, T}} where {V, N, names, T<:Tuple{Vararg{Any, N}}}, ::typeof(setprecision), f::Function, ::Type{T}, prec::Integer) where T @ Base.MPFR mpfr.jl:969)
Stacktrace:
 [1] setprecision
   @ ./mpfr.jl:969
 [2] Type
   @ ./irrationals.jl:69
 [3] <
   @ ./irrationals.jl:96
 [4] log1mexp
   @ ~/.julia/packages/LogExpFunctions/jq98q/src/basicfuns.jl:234
 [5] _broadcast_getindex_evalf
   @ ./broadcast.jl:683
 [6] _broadcast_getindex
   @ ./broadcast.jl:656
 [7] getindex
   @ ./broadcast.jl:610
 [8] broadcast_kernel
   @ ~/.julia/packages/GPUArrays/5XhED/src/host/broadcast.jl:59
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/validation.jl:149
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:415 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:414 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:89
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/utils.jl:83 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:129
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:110 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:106
 [10] compile
    @ ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:98 [inlined]
 [11] #1037
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:104 [inlined]
 [12] JuliaContext(f::CUDA.var"#1037#1040"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/driver.jl:47
 [13] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/compilation.jl:103
 [14] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:125
 [15] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/YO8Uj/src/execution.jl:103
 [16] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:318 [inlined]
 [17] macro expansion
    @ ./lock.jl:267 [inlined]
 [18] cufunction(f::GPUArrays.var"#broadcast_kernel#26", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(log1mexp), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:313
 [19] cufunction
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:310 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/CUDA/tVtYo/src/compiler/execution.jl:104 [inlined]
 [21] #launch_heuristic#1080
    @ ~/.julia/packages/CUDA/tVtYo/src/gpuarrays.jl:17 [inlined]
 [22] launch_heuristic
    @ ~/.julia/packages/CUDA/tVtYo/src/gpuarrays.jl:15 [inlined]
 [23] _copyto!
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/broadcast.jl:65 [inlined]
 [24] copyto!
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/broadcast.jl:46 [inlined]
 [25] copy
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/broadcast.jl:37 [inlined]
 [26] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(log1mexp), Tuple{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast ./broadcast.jl:873
 [27] top-level scope
    @ REPL[22]:1
 [28] top-level scope
    @ ~/.julia/packages/CUDA/tVtYo/src/initialization.j

Simply changing the definition of log1mexp to the following fixes the issue:

log1mexp_cuda(x::T) where {T <: Real} = x < log(T(1)/2) ? log1p(-exp(x)) : log(-expm1(x))

julia> log1mexp_cuda.(CuVector([-1f0, -2f0, -3f0]))
3-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 -0.4586752
 -0.14541346
 -0.051069178

Do we really need IrrationalConstants here?

David Widmann · Answer 1 · Wed Aug 16 2023 13:23:24 GMT+0800 (China Standard Time)

What exactly is the problem here? IrrationalConstants works in exactly the same way as the irrational constants in Base, so I wonder if the same problem can be provoked with e.g. pi instead of IrrationalConstants.loghalf. One advantage of these irrational constants is that they are precomputed for e.g. Float32 and Float64 but allow precise calculations also with other types and functions.

I'm very surprised that CUDA cares about the BigFloat methods if clearly only the Float32 constant is needed. Generally, I'm hesistant to remove IrrationalConstants since it is generally useful and used in Base and throughout the ecosystem, so it seems this problem should be fixed in a different way.

Kyurae Kim · Answer 2 · Thu Aug 17 2023 03:38:58 GMT+0800 (China Standard Time)

it seems this problem should be fixed in a different way.

Let me try to summon the CUDA experts.

Kyurae Kim · Answer 3 · Fri Aug 18 2023 03:01:44 GMT+0800 (China Standard Time)

I spoke with Tim Besard; it seems there is no easy way to do this as long as BigFloat is involved. It's because some of the BigFloat conversions call the libmpfr CPU library, which CUDA can't support.

David Widmann · Answer 4 · Fri Aug 18 2023 04:54:34 GMT+0800 (China Standard Time)

BigFloat should not be involved here - for irrationals in Base and IrrationalConstants, Float32(::MyIrrational) is explicitly defined and set to a constant precomputed value (the same for Float64).

David Widmann · Answer 5 · Fri Aug 18 2023 06:32:23 GMT+0800 (China Standard Time)

I figured out what's going on: The fallback definitions of the comparison operators (https://github.com/JuliaLang/julia/blob/6e2e6d00258b930f5909d576f2b3510ffa49c4bf/base/irrationals.jl#L96 and surrounding lines) are based not on Float32(x) but Float32(x, RoundDown) - which in contrast to Float32(x) is not defined with a constant but implemented dynamically based on BigFloat (https://github.com/JuliaLang/julia/blob/6e2e6d00258b930f5909d576f2b3510ffa49c4bf/base/irrationals.jl#L68-L72).

I wonder if we should extend the @irrational macros in Base and IrrationalConstants and define Float64(x, RoundDown/RoundUp) and Float32(x, RoundDown/RoundUp) explicitly statically using constants to avoid these dynamic dispatches at least for the common case where the irrational is defined with the macro.

David Widmann · Answer 6 · Fri Aug 18 2023 06:38:41 GMT+0800 (China Standard Time)

As suspected, the error is not IrrationalConstants specific: For instance,

julia> using CUDA, IrrationalConstants

julia> log1mexp_cuda(x::Real) = twoπ*exp(x) < π ? log1p(-exp(x)) : log(-expm1(x))
log1mexp_cuda (generic function with 1 method)

julia> log1mexp_cuda.(CuVector([-1f0, -2f0, -3f0]))
...

errors as well. I updated the title of the issue to reflect this.

Kyurae Kim · Answer 7 · Fri Aug 18 2023 06:50:11 GMT+0800 (China Standard Time)

Oh I see! I was scratching my head looking at the Float32(x, RoundDown) and wondering what it should have been. Shouldn't be handled upstream rather than overriding the behavior downstream? I think this issue might pop up in other places that depend on AbstractIrrational too.

David Widmann · Answer 8 · Fri Aug 18 2023 07:04:15 GMT+0800 (China Standard Time)

Sure, it will be present in basically all code paths that involve comparisons of FloatXX with AbstractIrrationals.

David Widmann · Answer 9 · Fri Aug 18 2023 16:36:17 GMT+0800 (China Standard Time)

The general issue still exists but should maybe raised upstream. The case in the OP was fixed by #75.

Kyurae Kim · Answer 10 · Tue Aug 22 2023 06:40:11 GMT+0800 (China Standard Time)

Okay, then I'll close this for now. I'll raise this upstream some time.

Tim Redick · Answer 11 · Tue Aug 22 2023 16:21:06 GMT+0800 (China Standard Time)

One addition: I ran into the same problem starting with Julia 1.9 (1.8 works fine) and opened an issue on the CUDA project which was moved to the GPUCompiler project: JuliaGPU/GPUCompiler.jl#384
Seems that the underlying issue with irrationals is not easy to resolve, so thanks for the effort here!