v5.3.0: regression in Zygote performance

Question

v5.3.0: regression in Zygote performance

AlexLewandowski opened this issue 2 months ago · comments

Alex Lewandowski commented 2 months ago

Describe the bug

Performance degradation on CUDA#v5.3.0 when taking gradients using Flux/Zygote.

To reproduce

The Minimal Working Example (MWE) for this bug:

import BenchmarkTools: @btime
using Flux
using CUDA
import Flux.Zygote

m = Chain(Dense(10, 512), Dense(512, 512), Dense(512, 10)) |> Flux.gpu
xs = randn(Float32, (10, 256)) |> Flux.gpu

function get_grads(m, xs)
    gs = Zygote.gradient(m) do m_
        sum(m_(xs))
    end
end

@btime get_grads($m, $xs)

# On CUDA 5.2:
# julia> @btime get_grads($m, $xs)
#   216.330 μs (585 allocations: 26.28 KiB)

# On CUDA 5.3:
# julia> @btime get_grads($m, $xs)
#  1.270 ms (1022 allocations: 34.69 KiB)

Manifest file for CUDAv5.3.0: https://gist.github.com/AlexLewandowski/e1b62445fb814d2adf1a7b87ff7d6a3b

Manifest file for CUDAv5.2.0: https://gist.github.com/AlexLewandowski/91fe5e60893039c1c45e2a317d1d7714

Expected behavior

Performance to be unaffected by CUDA.jl version upgrade.

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 33 on 32 virtual cores

Details on CUDA#v5.3.0:

CUDA runtime 12.4, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 1.004 GiB / 11.000 GiB available)

Details on CUDA#v5.2.0:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.171.4

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.171.4

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 3.067 GiB / 11.000 GiB available)

Additional context

I upgraded to v5.3.0 because I needed to take a gradient of a sorted CuArray with dims as a keyword. Not sure if its the version upgrade itself, or some combination of bad drivers. But I thought it might be worth raising as an issue.

Tim Besard · Answer 1 · Thu Apr 18 2024 16:31:16 GMT+0800 (China Standard Time)

Thanks for the report. I can't reproduce this locally, or at least not to the extent you're seeing (only a 280->310us regression). That makes it much harder to pinpoint what exactly has slowed down. Since you see a much more pronounced slowdown, can you isolate this problem to either the CUDA.jl operation that has regressed, or the commit that did so?

Christian Guinard · Answer 2 · Thu Apr 18 2024 23:37:47 GMT+0800 (China Standard Time)

I took the time to bisect this because it's causing my model training to completely stall. The performance regression seems to be #2290, but it also seems like #2327 (merged but not released) fixes it.

Pawan Bharadwaj · Answer 3 · Thu Apr 18 2024 23:59:56 GMT+0800 (China Standard Time)

I have the same issue after the upgrade. Please let me know if you need any other information, I have attached a Pluto file

https://gist.github.com/pawbz/36a915406266df540187049c1e0720b4

Tim Besard · Answer 4 · Fri Apr 19 2024 01:18:35 GMT+0800 (China Standard Time)

@AlexLewandowski @pawbz Can you try the CUDA.jl master branch?

Pawan Bharadwaj · Answer 5 · Fri Apr 19 2024 01:28:38 GMT+0800 (China Standard Time)

I have tried, no change, unfortunately.
Thanks for quick reply.

Christian Guinard · Answer 6 · Fri Apr 19 2024 02:48:56 GMT+0800 (China Standard Time)

Hey @pawbs, looking at your screenshot, I suspect your CUDA version did not update. Can you show the output of Pkg.status() in your notebook? Also make sure you restart the Pluto instance to make sure you load the correct version of CUDA.

You might also want to do this in a temporary environment by adding Pkg.activate(temp=true) right after you import Pkg to avoid cluttering up your default environment.

Jeremie Desgagne-Bouchard · Answer 7 · Fri Apr 19 2024 07:09:43 GMT+0800 (China Standard Time)

I just compared the original benchmark between v5.2.0 and current master:

@btime get_grads($m, $xs);
# v5.2.0:  230.077 μs (585 allocations: 26.28 KiB)
# master: 254.714 μs (889 allocations: 33.66 KiB)

The bulk of the regression is now gone. There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

Pawan Bharadwaj · Answer 8 · Fri Apr 19 2024 09:57:35 GMT+0800 (China Standard Time)

Pkg.activate(temp=true)

Here is an updated screenshot after restarting Pluto every time.
So basically, we have around 530us for both master and v5.2.0, and 1.2ms for v5.3.0
Thanks for the input earlier.

Tim Besard · Answer 9 · Fri Apr 19 2024 14:59:13 GMT+0800 (China Standard Time)

Thanks for confirming. So this was fid by #2327.

There remains a ~10% consistent with @maleadt result along increased allocations. Is it an expected impact of v5.3.0 or worth keep the issue open?

Unexpected, but probably not worth keeping the issue open over. If you can isolate this to the operation that has regressed, please open a new issue.