Documentation on GPU support

Question

Documentation on GPU support

Sbozzolo opened this issue 6 months ago · comments

I tried to use Interpolations.jl with CUDA and found myself deeply lost in the documentation and didn't know what to expect from the package. I found about GPU support from GitHub issues (and PR #504). But that's pretty much all the information available. All the documentation I could about GPU support is a short section in the "Developer documentation".

As I user, I would like to use interpolants from interpolations.jl in my CUDA kernels. The naive attempt of not doing anything special leads to functions that do not compile

        .T is of type Interpolations.ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} which is not isbits.
          .itp is of type Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}} which is not isbits.
            .coefs is of type Vector{Float64} which is not isbits.
        .u is of type Interpolations.ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} which is not isbits.
          .itp is of type Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}} which is not isbits.
            .coefs is of type Vector{Float64} which is not isbits.
        .q is of type Interpolations.ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} which is not isbits.
          .itp is of type Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}} which is not isbits.
            .coefs is of type Vector{Float64} which is not isbits.
        .P is of type Interpolations.ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} which is not isbits.
          .itp is of type Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}} which is not isbits.
            .coefs is of type Vector{Float64} which is not isbits.
        .c_co2 is of type Interpolations.ScaledInterpolation{Float64, 1, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} which is not isbits.
          .itp is of type Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, Interpolations.BSpline{Interpolations.Linear{Interpolations.Throw{Interpolations.OnGrid}}}, Tuple{Base.OneTo{Int64}}} which is not isbits.
            .coefs is of type Vector{Float64} which is not isbits.

I tried a bunch of things that didn't work, like changing constructors, or passing CuArrays to them.

Following the developer documentation, I managed to have a working function using adapt (which I found a little surprising, since I was expecting adapt to be only needed on the Interpolations.jl-side).

Some of the functions (e.g., adapt(CuArray{Float64}, itp) error out on printing (or, more specifically, they to scalar indexing on GPUs).

cuitp doesn't work on Vectors , or on scalars:

cuitp.(1:0.5:2 |> collect) #       .x is of type Vector{Float64} which is not isbits.
cuitp.(1:0.5:2 |> collect |> CuArray)  # This is fine
cuitp(1)  # Scalar indexing
cuitp.(Ref(1)) # This is fine

In this, I also found unclear if the higher-level constructors supported GPUs or not.

It would be very useful to clearly specify what does it mean for Interpolations.jl to support GPUs.

Mark Kittisopikul · Answer 1 · Thu Dec 07 2023 02:33:56 GMT+0800 (China Standard Time)

I can try, but perhaps @N5N3 would like to contribute.

N5N3 · Answer 2 · Thu Dec 07 2023 23:10:54 GMT+0800 (China Standard Time)

I tried a bunch of things that didn't work, like changing constructors, or passing CuArrays to them.

A quick reply is that the first stage of interpolating, named prefiltering, is not GPU compatible.
So you must construct the interpolant object on CPU side, then move the object to GPU side.
(I think the MWE clearly shows that? perhaps we need to emphasize it.)
Once you finish that adapt, cuitp supports kernal programming.
Just imaging you are coding a kernel for CuArray spreading, i.e. B[idx] = A[I[idx]] where A B and I are all CuArray, and you can replace A[...] with cuitp(...) here.

cuitp doesn't work on Vectors

IIRC, there's no auto Array/GPUArray mixed broadcasting in julia, so cuitp.(1:0.5:2 |> collect) is expected to fail.
The solution might be adding a transforming layer before copyto!(::CuArray, ::Broadcasted) but that should be added in GPUArrays.jl, not here.

Some of the functions (e.g., adapt(CuArray{Float64}, itp) error out on printing (or, more specifically, they to scalar indexing on GPUs).

For Scalar indexing warning, it happens mainly in displaying, so it should be OK.
I think it's better to let user judging if @allowscalar makes sense.

Gabriele Bozzola · Answer 3 · Fri Dec 08 2023 00:25:12 GMT+0800 (China Standard Time)

I tried a bunch of things that didn't work, like changing constructors, or passing CuArrays to them.

A quick reply is that the first stage of interpolating, named prefiltering, is not GPU compatible. So you must construct the interpolant object on CPU side, then move the object to GPU side. (I think the MWE clearly shows that? perhaps we need to emphasize it.) Once you finish that adapt, cuitp supports kernal programming. Just imaging you are coding a kernel for CuArray spreading, i.e. B[idx] = A[I[idx]] where A B and I are all CuArray, and you can replace A[...] with cuitp(...) here.

cuitp doesn't work on Vectors

IIRC, there's no auto Array/GPUArray mixed broadcasting in julia, so cuitp.(1:0.5:2 |> collect) is expected to fail. The solution might be adding a transforming layer before copyto!(::CuArray, ::Broadcasted) but that should be added in GPUArrays.jl, not here.

Some of the functions (e.g., adapt(CuArray{Float64}, itp) error out on printing (or, more specifically, they to scalar indexing on GPUs).

For Scalar indexing warning, it happens mainly in displaying, so it should be OK. I think it's better to let user judging if @allowscalar makes sense.

Thank you!

I just wish for all of this to be clearly presented and documented.

(I think the MWE clearly shows that? perhaps we need to emphasize it.)

The MWE is in a section that claims to be targeted toward developers (and I am assuming developers of Interpolations.jl), and offers no explanations. It allowed me to get a working interpolant, but shined no light on what was going on/what is supported/how I am supposed to use the package.

Also, given that all one needs to obtain a cuitp at the end of the day is apply adapt, would it make sense to provide a package extension for CUDA that does that automatically in the constructors for the interpolators?

The constructor might just dispatch over the input array and construct something GPU-compatible when given a CuArray:

function interpolator(y::CuArray)
   y_cpu = Array(y)
   return Adapt.adapt(CuArray{eltype(y)}, interpolator(y_cpu))
end

This will also allow downstream packages to use Interpolations.jl without directly depending on Adapt.

N5N3 · Answer 4 · Fri Dec 08 2023 00:50:04 GMT+0800 (China Standard Time)

I think doc could be improved by moving it to a new section and adding more usage information.

As for adapt, I think it should be to left as a soft blocker to users who want to create itp from a CuArray.
It's definitely inefficient as we would have to transfer the data twice, which should be avoided whenever possible.

Gabriele Bozzola · Answer 5 · Thu Dec 21 2023 06:23:13 GMT+0800 (China Standard Time)

Is there a way to have cuitp(t) to work instead of cuitp.(Ref(t))?

My use case is this. I have time series that I use as boundary conditions for evolving a system forward in time. More specifically, I have several functions that evaluate 1D splines at the given t. I would like to do this on a GPU.
I have several calls like bc_var = spline_var(t), but this doesn't work on a GPU because of scalar indexing. I would prefer to not change the code to have "fake" broadcasted expressions just to compute a collection of various scalars at each time step.

N5N3 · Answer 6 · Thu Dec 21 2023 16:01:23 GMT+0800 (China Standard Time)

Theoretically, the Ref could be removed. cuitp.(1) should work as expected.

But these kind of usage would be inefficient anyway. Even we support cuitp(1) outside gpu kernel, the scalar result still needs to be transfered back to CPU through a GPUArray wrapper thus there's no difference compared with the broadcast solution. And more importantly, the calculation could not be parallelized and it should be always slower than itp(1).

If you have many 1d splines to interpolate at each time step. A possible solution is combine them into a 2d interpolation and mark the 2nd dim as NoInterp and you can get the result by cuitp.(1, 1:10)