Are there any plans to streamline compiling kernel source codes?

Question

Are there any plans to streamline compiling kernel source codes?

l3utterfly opened this issue a year ago · comments

I've been using CUDA currently by writing a small C library exposing functions/structs/methods etc. that I need and linking statically to my Rust project. It is cumbersome, but has the advantage that I can use tools like "bindgen" to generate struct/function representations automatically, so I don't need to write struct definitions in two places.

Are there any plans to streamline compiling the kernel code so perhaps structs are automatically generated on the rust side, or the CUDA side?

Max Obreiter · Answer 1 · Tue Apr 18 2023 22:48:32 GMT+0800 (China Standard Time)

just to clarify, do you mean the kernel structs?

l3utterfly · Answer 2 · Tue Apr 18 2023 23:19:01 GMT+0800 (China Standard Time)

For example, in the device_repr example, you have structs that the kernel needs as input. Those are defined two times, once in the kernel code, once in Rust. It would be nice if we can somehow define the struct in one place (either on the C or Rust side), and generate the other side.

Additionally, if I'm not mistaken, currently there's two methods to read kernel code:

read at runtime (either by reading a pre-generated ptx code, or reading the cu code directly and compiling)
directly write the kernel code as a string in Rust, then compiling

(1) means the kernel code will have to be distributed along with the exe, which might not be desirable
(2) makes is hard to maintain: there's no intelli-sense when the code is a Rust const &str. Perhaps I can use the macro !include_str to include my code written in CU files, but that still feels really wonky.

Max Obreiter · Answer 3 · Tue Apr 18 2023 23:50:28 GMT+0800 (China Standard Time)

hmm, @coreylowman has a better overview of this crate than me lol. but this crate is something like a dfdx-sys crate, which just adds bindings to cuda and cudnn. also, using include_str is fine for smaller use cases and you can have a look at how dfdx handles this.

l3utterfly · Answer 4 · Wed Apr 19 2023 17:20:03 GMT+0800 (China Standard Time)

Playing around with the library a bit more, I've settled on the following workflow for writing kernels, generating bindings, compiling CUDA, and last but not least, using this library to call the kernels.

Write CUDA kernels in separate folder/project. They can be as complicated as you need them to be: multiple files, include headers, call functions etc.
Create a wrapper.h including all the struct definitions used in your kernels
In the Rust project, write a build script that does the following:
a. Build the kernels (.cu files) by invoking the nvcc command with -ptx flag. Output them in the build directory
b. Use bindgen to generate structs on the Rust side
c. Use regex to replace any pointers (e.g. *mut f32) with CudaSlice<f32>
d. Save the edits to the generated binding.rs and output to build directory
Include the bindings in your Rust project
Compile PTX using the built-in functions in this crate (cudarc)
After all this is setup, calling your kernels and allocating memory for your structs are all covered by this crate's examples

Thinking about this a bit more, I can see most of this falls outside of the scope of this crate. Perhaps this can be included as an example for new users? This workflow took me quite a bit of time to figure out (since I'm new to CUDA).

Ngl, I was almost turned off of this crate due to no straightforward way of compiling kernel code apart of pasting them in the Rust code as a string, which is a terrible way to include code in general.

I'm happy to write this up as an example if you feel this can help your crate users.

Max Obreiter · Answer 5 · Wed Apr 19 2023 17:46:46 GMT+0800 (China Standard Time)

3. Use regex to replace any pointers (e.g. *mut f32) with CudaSlice<f32>

be careful if you use other pointer types

l3utterfly · Answer 6 · Wed Apr 19 2023 18:22:02 GMT+0800 (China Standard Time)

What other are there?

I was thinking just replacing *mut f32 -> CudaSlice<f32>, *mut i32 -> CudaSlice<i32> etc.

Max Obreiter · Answer 7 · Wed Apr 19 2023 18:32:02 GMT+0800 (China Standard Time)

hmm, not sure, but you are definitely missing *const _ -> &CudaSlice<_>.

Corey Lowman · Answer 8 · Wed Apr 19 2023 22:07:33 GMT+0800 (China Standard Time)

Hey thanks for sharing this! Yeah all the steps you listed are accurate as far as compiling kernels. The dfdx build script looks fairly similar to the steps you listed (although without the extra bindgen stuff).

I think including a complex build script would be a good addition, any more complex use cases will need to figure out build scripts eventually.

Another way to do all this is to use nvcc in build script as you've done, and then include the compiled PTX source in the rust code with const PTX_SRC: &str = include_str!(concat!(env!("OUT_DIR"), "/<name of file>.ptx"));. I detail this a bit more in this blog post: https://coreylowman.github.io/2023/04/07/cudarc-stack.html#at-compile-time-with-nvcc

Have you experimented with that at all?

I totally agree that streamlining this would be a huge win. It's very complicated at the moment

l3utterfly · Answer 9 · Thu Apr 20 2023 20:30:20 GMT+0800 (China Standard Time)

Another way to do all this is to use nvcc in build script as you've done, and then include the compiled PTX source in the rust code with const PTX_SRC: &str = include_str!(concat!(env!("OUT_DIR"), "/.ptx"));. I detail this a bit more in this blog post: https://coreylowman.github.io/2023/04/07/cudarc-stack.html#at-compile-time-with-nvcc

Yep! That's exactly what I did. Glad we arrived at the same solution.

Corey Lowman · Answer 10 · Thu Apr 20 2023 23:46:28 GMT+0800 (China Standard Time)

Awesome, yeah if you have any other ways to improve this definitely let me know 😀 An example in this repo would be great

l3utterfly · Answer 11 · Fri Apr 21 2023 21:49:14 GMT+0800 (China Standard Time)

Sure, will write something up. I'll create a pull request so you can review it when ready.

Corey Lowman · Answer 12 · Sat Apr 29 2023 03:49:02 GMT+0800 (China Standard Time)

Closing as the PR was merged, thanks for opening that!