Code generation: what types of operations do we support?

Question

Code generation: what types of operations do we support?

dfdx opened this issue 7 years ago · comments

Andrei Zhabinski commented 7 years ago

Here's what comes to mind:

Element-wise, e.g. x[i] + y[i].
Broadcasting, e.g. exp(x[i]) or x[i] + y or x[i,j] + y[i].
Matrix-by-matrix multiplication.
Convolution.

Currently the plan is to use for CPU:

Julia's broadcasting and fuse operations, e.g. exp.(x .+ y).
The same.
A_mul_B and friends.
???

For GPU:

Fuse, generate single kernel for all sequential element-wise operations.
The same.
CUBLAS?
CUDNN?

It's also important to reuse memory, storing buffers in a pool which is available between iterations.

James Bradbury · Answer 1 · Mon Jul 10 2017 04:46:48 GMT+0800 (China Standard Time)

GPUArrays uses CUDAnative for (1) and doesn’t yet support (2); (3) would be cuBLAS (also already in GPUArrays) not cuDNN; (4) is definitely cuDNN.

Andrei Zhabinski · Answer 2 · Mon Jul 10 2017 05:12:17 GMT+0800 (China Standard Time)

Thanks for the comment. Yes, (3) is about cuBLAS, fixed in the text. As for (4), as far as I know, there's currently no actively developed wrapper for cuDNN.

All in all, after looking at the current state of CUDAnative and GPUArrays I believe they will cover most of code generation here, so I'm currently working on other optimizations (operation fusion, reusing buffers, etc.).

James Bradbury · Answer 3 · Mon Jul 10 2017 05:14:46 GMT+0800 (China Standard Time)

They’ll cover codegen for elementwise and simple reduction functions on compact arrays, but elementwise and reduction functions on strided arrays/subtensors and fusion between elementwise and reduction functions is significantly more difficult to get right.

James Bradbury · Answer 4 · Mon Jul 10 2017 05:18:07 GMT+0800 (China Standard Time)

For instance, if you pass a struct with strides/sizes into a GPU kernel in order to do index calculations, you just made it dozens of times slower because it’s probably going to be copied into every thread's local memory; instead you need to do as much indexing math as possible ahead of time with run-time compilation (i.e. generated functions)

Andrei Zhabinski · Answer 5 · Mon Jul 10 2017 21:58:10 GMT+0800 (China Standard Time)

For instance, if you pass a struct with strides/sizes into a GPU kernel in order to do index calculations

This is not an issue for this package, at least under the current vision. The idea is to make a restricted version of the language where all operations are differentiable (see XDiff.jl for details) and code generation friendly. Indexing and subtensors (e.g. x[42] or x[1:5]) aren't on the list anyway, but I'd like to clarify situation with broadcasting. Am I correct that none of the following will be supported anywhere soon:

exp.(x)
x .+ c (where c is a scalar)
X .+ y (where X = rand(3,4) and y = rand(3), for example)

I understand why (3) might be tricky, but (1) and (2) look easy for custom GPU code at least. If it's not for CUDANative, I can preprocess input (e.g. fill(c, size(x))).

James Bradbury · Answer 6 · Tue Jul 11 2017 01:47:42 GMT+0800 (China Standard Time)

1 is already supported; 2 is either supported or will be soon; and 3 is definitely on their roadmap. But the caveat is that at least for now all of them will only work if x, y, etc. are compact rather than strided. The only reason this potentially matters to you is that everything this package does is agnostic to the underlying storage modality of the tensor, but any generated code that actually touches memory wouldn’t be.

Andrei Zhabinski · Answer 7 · Tue Jul 11 2017 04:55:45 GMT+0800 (China Standard Time)

Thanks for clarification!