cupy / cupy

NumPy & SciPy for GPU

Home Page:https://cupy.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shared Work Areas for FFT Plans

sievers opened this issue · comments

Description

We're writing a code that repeatedly generates hundreds of large (~1 GB) arrays on the GPU, which we then FFT, process, and free. In general, the arrays have different sizes. If we run naively with the cupy fft routines, the run time is completely dominated by the work area creation when making the plans. If we try to create the plans in advance with cupy, then we rapidly run out of memory. As a workaround, we've written our own wrapper to cufft that lets us use a common work area for all our plans (via cufftSetAutoAllocation and cufftSetWorkArea). Is it possible to do the same thing using the cupy plan cache? I haven't been able to find anything in the documentation, or any suggestive-looking function calls.

It would sure be nice to be able to use cupy and not keep a second almost-redundant library around. Using our own wrapper led to a factor of several speed-up in run time, so I think this would benefit anyone doing lots of large FFTs of varying sizes. I'd be happy to help out if it would be useful.

Additional Information

No response

If we run naively with the cupy fft routines, the run time is completely dominated by the work area creation when making the plans.

This is surprising because work area creation should be cheap, especially after you "warm up" the memory pool with a few runs. Could you share with us:

  • Do you use multiple CUDA streams?
  • Can you show a mini reproducer for such a slowness, or your intended usage (such as your own wrapper), if possible?

Anyway, a few words without further context. CuPy's FFT plans are designed in a way that is reuse friendly. The work area is always kept with the underlying cuFFT plan, so once created/retained the plan object can be reused for the same problem sizes over and over. The plan cache was built around this assumption, so frankly your use case is orthogonal (and arguably unconventional) to past reports that led to the assumption. However, it's a legit one that is not unheard of, e.g. I am aware of a signal process use case, and with the recent expansion of cupyx.{scipy.}signal it might be worth revisiting.

Thanks very much for your response. I've got an example nearly ready to go, but it's not quite fair right now since I'm using pre-allocated buffers for the fft output areas in my code, but not with cupy. So, I don't know how much of the speedup to attribute to plans vs. not allocating the output buffers. Is there a recommended way of specifying the output area with cp.fft.rfft/irfft?

Any memory allocation should be quick in CuPy, due to the internal use of a memory pool. We minimize the calls to cudaMalloc/cudaFree. The cuFFT plan creation is the most time consuming step IIRC.

I've put up a repo with my wrapper to the cufft planning stuff, and a python script that compares that to naive cupy calls. The wrapped, pre-planned version is nearly twice as fast as the cupy version. Repo is at:
https://github.com/sievers/cupy_fft_tests

Sorry, forgot to answer this:

Do you use multiple CUDA streams?
In my tests, I've been using a single stream.

Thanks, @sievers, I can more or less reproduce the observation. Your 3rd for loop isn't really necessary -- CuPy's plan cache is on by default, so the 2nd loop (what you called "naive cupy") is already using it.

With a quick profiling, I see all time (>97%) is spent in plan creation (Plan1d()). After running the 2nd loop, you can do

cache = cp.fft.config.get_plan_cache()
print(cache)

and see that there's no cache hit. Every time you call CuPy's FFT functions, a plan is recreated due to your problem setup. The cause is:

  • ndet is your batch size
  • lens is your FFT length
  • you do r2c and then c2r

so the number of distinct combinations is len(ndet) x len(lens) x 2 = 360, which can be confirmed via your custom MultiPlan cache

plans=pycufft.MultiPlan(sizes)
print(f"{len(plans.plans)=}")

This large number of plans would not fit in CuPy's LRU cache, so the old plans keep being ejected to accommodate new plans, leading to no reuse.

Naively increase CuPy's plan cache size to 360 would not work, because each of your transform needs about 800 MiB of workspace. That'd take too much memory for 360 plans. I think your suggestion of sharing a common work area makes some sense. However, it'd make CuPy harder to maintain stream order, as it is impossible for the plan cache to track which stream is being used by the FFT.

(btw, before I forget: The current plan cache is already unsafe with multiple streams, due to the potential contention of using the same workspace + same plan on different streams. And we're now taking about using the same workspace + potentially different plans on potentially different streams, which adds one extra level of complexity.)

Before we jump into discussing a solution, could you please provide more info?

  • Are you doing online (live) or offline data processing?
  • Is it possible to pick the largest ndet and pad/retain/reuse input buffers of shape (ndet_max, len) with len from lens? This would cut the number of plans by a factor of 10 in your reproducer. The idea is you try to reuse the same plan with the largest batch size, pad the input, and truncated the output in the padded (batch) dimension.

Thanks very much for having a look.

Are you doing online (live) or offline data processing?

this is all offline (we process ~1 year of data at a time).

Is it possible to pick the largest ndet and pad/retain/reuse input buffers...

That's an interesting suggestion we might be able to do. We have freedom in how we group our data onto nodes (10K chunks in a run is common, usually split onto several compute nodes), but we have other pressures on how we organize things so it wouldn't be trivial.

I've got another thought which I'm going to test out. Except for the work areas, the plans themselves are tiny, so there's usually no need to evict any of them. Instead of keeping the plans ready to go, you could instead do something like have cupy malloc the work area when you call the fft and/or have an fft version where you pass in the work area as an extra argument. I think that should be fast (I'll let you know when I have a working version), and avoids a bunch of the thread/stream safety issues.

I tried out the assigning the work area at run-time, and couldn't tell a difference in speed. Since that worked, I tried one step further, which was to create the plans on the fly, and set the work area with cp.empty. For my size transforms, that only ran 0.5% slower than the pre-allocated setup. Honestly, if we had a flag that said "for arrays larger than X MB, compute the plan, and use cupy to malloc" for the plan cache, we'd have close-enough-to-ideal performance we'd be happy.

One important change (that nvidia told me I should have done from the beginning, but I hadn't read the cufft documentation closely enough) was that I switched the plan allocation. I had been calling cufftPlan1d, then setting auto_allocate to zero and nuking the workspace, but that means you allocate then free the work space. Instead, I switched to cufftCreate(plan), then called cufftSetAutoAllocation(plan,0) before calling cufftMakePlan1d (not cufftPlan1d). That avoids cuda ever making a work area, and sped up plan creation by a factor of 50.

Anyways, I've pushed the updated code, including the fully on-the-fly plan and work area allocation test. I didn't bother to destroy the plans after I used them, and 1800 memory-leaked plans without work areas in, the GPU was still running merrily along. Switching to run-time work area assignment for non-tiny arrays means you could probably afford a per-stream/thread plan cache and avoid pretty much all race conditions.

Let me know what you think, and happy to discuss more, but I think this is promising!

Since that worked, I tried one step further, which was to create the plans on the fly, and set the work area with cp.empty. For my size transforms, that only ran 0.5% slower than the pre-allocated setup.

I am not sure I follow this. Could you elaborate? This is precisely what CuPy currently does with your example (full of cache misses and no cache hit).

Instead, I switched to cufftCreate(plan), then called cufftSetAutoAllocation(plan,0) before calling cufftMakePlan1d (not cufftPlan1d).

We already do this in our Plan1d and other plan objects too.

I am not sure I follow this. Could you elaborate? This is precisely what CuPy currently does with your example (full of cache misses and no cache hit).

The penalty from a cache miss is not from generating the plan, it's from cuda allocating the work area for the plan when cuda creates it. If instead you have cuda create a plan without a work area, and use a cupy-allocated array for the work area, the penalty for a cache miss becomes tiny (shrinks by two orders of magnitude for me). For large arrays, it's so small that there's no benefit to even having a pre-computed plan, as long as cupy allocates the work area instead of cuda (thanks, BTW, for pointing out how fast the cupy malloc is. I hadn't appreciated that until you told me).

We already do this in our Plan1d and other plan objects too.

In practice, calling Plan1d does seem to allocate the work area, at least by default (see new script test_cache_sizes.py in same repo). If there is a way to call Plan1d so it doesn't create a work area, I couldn't find it in the documentation, and my guess of adding "work_area=None" as an argument didn't work. If there is a way to do this and you could point me to documentation and/or an example script, I'd be very grateful. That would solve everything for us.

The penalty from a cache miss is not from generating the plan, it's from cuda allocating the work area for the plan when cuda creates it. If instead you have cuda create a plan without a work area, and use a cupy-allocated array for the work area, the penalty for a cache miss becomes tiny (shrinks by two orders of magnitude for me).

This is not true. As I said, CuPy already turned off cuFFT's auto allocation of workarea, and instead drew memory from CuPy's mempool. The time is all spent in constructing a cuFFT plan via cuFFTMakePlan1d inside Plan1d(), which is known very costly.

For large arrays, it's so small that there's no benefit to even having a pre-computed plan,

This is again not true for the aforementioned reason.

In practice, calling Plan1d does seem to allocate the work area, at least by default

Yes, but the workarea is allocated from CuPy's mempool, not done by cuFFT or explicitly calling cudaMalloc. So it's fast and you too observed it.

If there is a way to call Plan1d so it doesn't create a work area, I couldn't find it in the documentation, and my guess of adding "work_area=None" as an argument didn't work. If there is a way to do this and you could point me to documentation and/or an example script, I'd be very grateful. That would solve everything for us.

Let me think of a proper way to get this (and other) issue addressed. I think what you said earlier:

Switching to run-time work area assignment for non-tiny arrays means you could probably afford a per-stream/thread plan cache and avoid pretty much all race conditions.

is probably a good idea to pursue. But I might be slow due to my day job.

The time is all spent in constructing a cuFFT plan via cuFFTMakePlan1d inside Plan1d(), which is known very costly.

So, I'm very confused about what's going on now, at least with rfft/irfft. I know you said earlier when I looped with cp.fft.rfft/irfft that the time was all in the plan generation, but if I manually call Plan1d, then I get decent performance (it's still maybe 10% or so slower than when I call cuda directly, but it's at least close). What I don't understand is why rfft/irfft is so much slower than when I just call Plan1d. I've got a purely cupy code now that ffts by creating a plan via Plan1d, creating an output array, calling the fft, then deleting the plan. rfft/irfft are 60% slower then that, so they must be doing something else.

When you said that rfft/irfft were spending all their time in Plan1d, the only way that made sense to me was if Plan1d was using the cuda malloc. Having tried Plan1d myself now, I see what you mean, but that still leaves a mystery (to me at least) about what's going on inside rfft/irfft. Anyways, for our purposes, we can go the Plan1d route.

is probably a good idea to pursue. But I might be slow due to my day job.

Thanks for all your time and feedback, I really appreciate it!