dfdx / Espresso.jl

Expression transformation package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code generation: what types of operations do we support?

dfdx opened this issue · comments

Here's what comes to mind:

  1. Element-wise, e.g. x[i] + y[i].
  2. Broadcasting, e.g. exp(x[i]) or x[i] + y or x[i,j] + y[i].
  3. Matrix-by-matrix multiplication.
  4. Convolution.

Currently the plan is to use for CPU:

  1. Julia's broadcasting and fuse operations, e.g. exp.(x .+ y).
  2. The same.
  3. A_mul_B and friends.
  4. ???

For GPU:

  1. Fuse, generate single kernel for all sequential element-wise operations.
  2. The same.
  3. CUBLAS?
  4. CUDNN?

It's also important to reuse memory, storing buffers in a pool which is available between iterations.

GPUArrays uses CUDAnative for (1) and doesn’t yet support (2); (3) would be cuBLAS (also already in GPUArrays) not cuDNN; (4) is definitely cuDNN.

Thanks for the comment. Yes, (3) is about cuBLAS, fixed in the text. As for (4), as far as I know, there's currently no actively developed wrapper for cuDNN.

All in all, after looking at the current state of CUDAnative and GPUArrays I believe they will cover most of code generation here, so I'm currently working on other optimizations (operation fusion, reusing buffers, etc.).

They’ll cover codegen for elementwise and simple reduction functions on compact arrays, but elementwise and reduction functions on strided arrays/subtensors and fusion between elementwise and reduction functions is significantly more difficult to get right.

For instance, if you pass a struct with strides/sizes into a GPU kernel in order to do index calculations, you just made it dozens of times slower because it’s probably going to be copied into every thread's local memory; instead you need to do as much indexing math as possible ahead of time with run-time compilation (i.e. generated functions)

For instance, if you pass a struct with strides/sizes into a GPU kernel in order to do index calculations

This is not an issue for this package, at least under the current vision. The idea is to make a restricted version of the language where all operations are differentiable (see XDiff.jl for details) and code generation friendly. Indexing and subtensors (e.g. x[42] or x[1:5]) aren't on the list anyway, but I'd like to clarify situation with broadcasting. Am I correct that none of the following will be supported anywhere soon:

  1. exp.(x)
  2. x .+ c (where c is a scalar)
  3. X .+ y (where X = rand(3,4) and y = rand(3), for example)

I understand why (3) might be tricky, but (1) and (2) look easy for custom GPU code at least. If it's not for CUDANative, I can preprocess input (e.g. fill(c, size(x))).

1 is already supported; 2 is either supported or will be soon; and 3 is definitely on their roadmap. But the caveat is that at least for now all of them will only work if x, y, etc. are compact rather than strided. The only reason this potentially matters to you is that everything this package does is agnostic to the underlying storage modality of the tensor, but any generated code that actually touches memory wouldn’t be.

Thanks for clarification!