Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9)
drewrobson opened this issue · comments
Describe the bug
Certain broadcast expressions that previously executed on the GPU (on Julia 1.9.3) and returned a CuArray are instead triggering scalar indexing warnings (on Julia 1.10.1) and returning an Array.
To reproduce
The Minimal Working Example (MWE) for this bug:
using CUDA
d_test = CUDA.ones(5)
getindex.(Ref(d_test), keys(d_test))
Expected behavior
Based on previous Julia versions, the MWE should produce a CuVector{Float32}:
5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.0
1.0
1.0
1.0
1.0
Version info
Details on Julia:
Julia Version 1.10.1
Commit 7790d6f064* (2024-02-13 20:41 UTC)
Build Info:
Note: This is an unofficial build, please report bugs to the project
responsible for this build and not to the Julia project unless you can
reproduce the issue using official builds available at https://julialang.org/downloads
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Details on CUDA:
CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3
CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.54.3
Julia packages:
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0
Toolchain:
- Julia: 1.10.1
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce RTX 4090 (sm_89, 19.250 GiB / 23.988 GiB available)
Additional context
On Julia 1.9.3, Base.broadcasted(getindex, Ref(d_test), keys(d_test))
yields a
Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}
On Julia 1.10.1, the same expression yields a
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}
This change in behavior broke some more complicated broadcast expressions (the MWE was reduced from one of these). For now, I am working around the issue by specifying a CuArray destination, like this:
d_result .= getindex.(Ref(d_test), keys(d_test))
(but that means figuring out the output type and dimensions first, which adds a step during development/prototyping)
Thanks!
This was an deliberate change, see JuliaGPU/GPUArrays.jl#510 for the rationale.
It's too bad this trips up in your code, as I had hoped to sneak this in without having to tag a breaking release...
Thanks very much, makes sense. I like the clarity of the capture approach - it's easier to see the arguments that actually participate in broadcasting in a nontrivial way.
I'm updating my code, but in many cases all the "GPU-residing" objects are now captures. The MWE is such a case: keys(d_test)
is (Base.OneTo(5),)
so the naive fix wouldn't work:
function test()
d_test = CUDA.ones(5)
broadcast(keys(d_test)) do idx
d_test[idx]
end
end
This leads to a question I've been wanting to ask anyways:
Certain lightweight objects like OneTo(1000000)
seem equally happy broadcasting on the host or the GPU (which is I think why cu(OneTo(1000000))
doesn't "move" anything to the device). Is there a way to opt into GPU execution? For broadcast!
we can write
d_result .= foo.(OneTo(1000000))
For broadcast
, is there anything easier than manually constructing a Broadcasted{CuArrayStyle}
object?
For
broadcast
, is there anything easier than manually constructing aBroadcasted{CuArrayStyle}
object?
I don't know of anything like that, but I agree it would be useful to override the broadcaststyle in a more ergonomic way. Maybe something to open an issue about upstream?