Inverse Complex-to-Real FFT allocates GPU memory

Question

Inverse Complex-to-Real FFT allocates GPU memory

navdeeprana opened this issue 4 months ago · comments

Navdeep Rana commented 4 months ago

Describe the bug

Inverse Complex-to-Real FFT allocates GPU memory, whereas inverse Complex-to-Complex FFT does not.

To reproduce

The Minimal Working Example (MWE) for this bug:

using AbstractFFTs, CUDA, LinearAlgebra
CUDA.allowscalar(false)

u = CuArray(rand(512,512))
uk = rfft(u)
pfor = plan_rfft(u)
pinv = plan_irfft(uk, 512)
mul!(u, pinv, uk)
println("Complex-to-Real")
CUDA.@time mul!(u, pinv, uk);

u = CuArray(rand(ComplexF64,512,512))
uk = fft(u)
pfor = plan_fft(u)
pinv = plan_ifft(uk)
mul!(u, pinv, uk)
println("Complex-to-Complex")
CUDA.@time mul!(u, pinv, uk);

Complex-to-Real
  0.000091 seconds (20 CPU allocations: 800 bytes) (1 GPU allocation: 2.008 MiB, 13.43% memmgmt time)
Complex-to-Complex
  0.000168 seconds (132 CPU allocations: 11.141 KiB)

Manifest.toml

CUDA v5.1.2
GPUCompiler v0.25.0
LLVM v6.4.2

Expected behavior

No allocations?

Version info

Details on Julia:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 2 on 48 virtual cores
Environment:
  JULIA_DEPOT_PATH = /data.lmp/nrana/.julia
  JULIA_NUM_THREADS = 1

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 510.108.3, originally for CUDA 11.6

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 11.0.0+510.108.3

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

4 devices:
  0: NVIDIA A100-PCIE-40GB (sm_80, 37.391 GiB / 40.000 GiB available)
  1: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  2: NVIDIA A100-PCIE-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  3: NVIDIA A100-PCIE-40GB (sm_80, 38.363 GiB / 40.000 GiB available)

Additional context

Add any other context about the problem here.

Tim Besard · Answer 1 · Wed Jan 24 2024 05:24:10 GMT+0800 (China Standard Time)

Known and expected; this is a bug in CUFFT, and NVIDIA has updated the documentation to indicate that these operations are expected to mutate inputs, so we need to take a copy of them.