Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9)

Question

Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9)

drewrobson opened this issue 3 months ago · comments

Describe the bug

Certain broadcast expressions that previously executed on the GPU (on Julia 1.9.3) and returned a CuArray are instead triggering scalar indexing warnings (on Julia 1.10.1) and returning an Array.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA
d_test = CUDA.ones(5)
getindex.(Ref(d_test), keys(d_test))

Expected behavior

Based on previous Julia versions, the MWE should produce a CuVector{Float32}:

5-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0

Version info

Details on Julia:

Julia Version 1.10.1
Commit 7790d6f064* (2024-02-13 20:41 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.1
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 19.250 GiB / 23.988 GiB available)

Additional context

On Julia 1.9.3, Base.broadcasted(getindex, Ref(d_test), keys(d_test)) yields a

Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}

On Julia 1.10.1, the same expression yields a

Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(getindex), Tuple{Base.RefValue{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearIndices{1, Tuple{Base.OneTo{Int64}}}}}

This change in behavior broke some more complicated broadcast expressions (the MWE was reduced from one of these). For now, I am working around the issue by specifying a CuArray destination, like this:

d_result .= getindex.(Ref(d_test), keys(d_test))

(but that means figuring out the output type and dimensions first, which adds a step during development/prototyping)

Thanks!

Tim Besard · Answer 1 · Wed Feb 28 2024 20:33:16 GMT+0800 (China Standard Time)

This was an deliberate change, see JuliaGPU/GPUArrays.jl#510 for the rationale.
It's too bad this trips up in your code, as I had hoped to sneak this in without having to tag a breaking release...

Drew Robson · Answer 2 · Wed Feb 28 2024 22:36:28 GMT+0800 (China Standard Time)

Thanks very much, makes sense. I like the clarity of the capture approach - it's easier to see the arguments that actually participate in broadcasting in a nontrivial way.

I'm updating my code, but in many cases all the "GPU-residing" objects are now captures. The MWE is such a case: keys(d_test) is (Base.OneTo(5),) so the naive fix wouldn't work:

function test()
    d_test = CUDA.ones(5)
    broadcast(keys(d_test)) do idx
        d_test[idx]
    end
end

This leads to a question I've been wanting to ask anyways:

Certain lightweight objects like OneTo(1000000) seem equally happy broadcasting on the host or the GPU (which is I think why cu(OneTo(1000000)) doesn't "move" anything to the device). Is there a way to opt into GPU execution? For broadcast! we can write

d_result .= foo.(OneTo(1000000))

For broadcast, is there anything easier than manually constructing a Broadcasted{CuArrayStyle} object?

Tim Besard · Answer 3 · Sat Mar 02 2024 17:13:49 GMT+0800 (China Standard Time)

For broadcast, is there anything easier than manually constructing a Broadcasted{CuArrayStyle} object?

I don't know of anything like that, but I agree it would be useful to override the broadcaststyle in a more ergonomic way. Maybe something to open an issue about upstream?