Write code generation tool for creation of CUDA wrappers to device functions

Question

Write code generation tool for creation of CUDA wrappers to device functions

maxpkatz opened this issue 7 years ago · comments

This issue describes a recommended approach for launching Fortran functions as CUDA kernels. It uses as an example the Castro function

ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi)

In order to launch this function on the device, a CUDA kernel needs to be launched first, and then this function needs to be called inside the kernel as a device function. This should be done by wrapping this function in DEVICE_LAUNCHABLE(), as:

DEVICE_LAUNCHABLE(ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi));

(When we're not compiling for the device, this will be a simple C++ preprocessor function macro that does nothing.)

This should be expanded to:

__device__ void ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

That is, it should prepend __device__ to the target Fortran function (which must have attributes(device) manually prepended to it). It should also create another function declaration prepended with cuda_, that has the same arguments.

The new cuda_ function should look like:

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const amrex::Real* state, const int* state_lo, const int* state_hi)
{
int blo[3];
int bhi[3];
get_loop_bounds(blo, bhi, lo, hi);
ca_compute_temp(blo, bhi, state, state_lo, state_hi);
}

and should be declared in a separate compilation unit, not the header file (a reasonable choice would be a single .cpp file that contains all of the newly created CUDA functions).

Note that get_loop_bounds is a function that is found in AMReX_Device.H.

The corresponding call to this function should be:

DEVICE_LAUNCH(ca_compute_temp(lo, hi, state, state.loVect(), state.hiVect()));

This should be replaced by:

 dim3 numThreads, numBlocks;
 amrex::Device::c_threads_and_blocks(lo, hi, numBlocks, numThreads);
 cuda_ca_compute_temp<<<numBlocks, numThreads, 0, amrex::Device::cudaStream()>>>(lo, hi, state, state.loVect(), state.hiVect());

Note that this makes it a requirement that lo and hi are the first two arguments to the function. This way they can be replaced by the zone index corresponding to each CUDA thread.

Max Katz commented 5 years ago

Yes.

Max Katz · Answer 1 · Thu Nov 09 2017 13:14:20 GMT+0800 (China Standard Time)

One issue is that we often use C++ macros in these calls, e.g.

void ca_compute_temp(const int* lo, const int* hi, const BL_FORT_FAB_ARG_3D(state));

This can make it tricky to know what the actual arguments are. So we need to make sure we run these files through cpp before we generate this.

Michael Zingale · Answer 2 · Thu Nov 09 2017 22:53:44 GMT+0800 (China Standard Time)

I can take a pass at implementing this, but I need some code that uses it to test it out.

Max Katz · Answer 3 · Fri Nov 10 2017 00:57:40 GMT+0800 (China Standard Time)

A nicer -- though more technically challenging -- implementation:

If the function is wrapped with DEVICE_LAUNCH() in the C++ code, then the script will search for the corresponding function declaration in the header, and automatically modify it with __device__ and add the corresponding __global__ CUDA wrapper. This removes the need to mark up both the function call and the header.

Max Katz · Answer 4 · Fri Nov 10 2017 00:59:50 GMT+0800 (China Standard Time)

Carrying that thought further, using similar machinery, we could also search for the corresponding Fortran function implementation and mark it up with attributes(device) automatically.

Max Katz · Answer 5 · Fri Nov 10 2017 01:02:46 GMT+0800 (China Standard Time)

In yet more speculative territory, there is the possibility that we could use this to create two versions of the function, one with attributes(device) and one without. This would allow the possibility of also calling it from the host. This can be a "version 2.0" feature but is something that is strongly on my radar.

Michael Zingale · Answer 6 · Thu Feb 22 2018 03:45:56 GMT+0800 (China Standard Time)

update: we are not prepending __device__

Michael Zingale · Answer 7 · Thu Feb 22 2018 08:58:45 GMT+0800 (China Standard Time)

first pass at this is done, as StarLord/Source/device_script.py

Max Katz · Answer 8 · Mon Mar 26 2018 23:12:01 GMT+0800 (China Standard Time)

It was pointed out to me last week that a much cleaner approach than using macros is to do something like

#pragma gpu

on the line above a call that we want to execute on the GPU. We could put a corresponding

#pragma gpu_target

on the line above the header declarations.

Michael Zingale · Answer 9 · Mon Mar 26 2018 23:57:36 GMT+0800 (China Standard Time)

but we'd still need the script to write the additional interfaces, right?

Max Katz · Answer 10 · Tue Mar 27 2018 00:06:08 GMT+0800 (China Standard Time)

Yes. This doesn't change any of the requirements, only the syntax we're using.

Max Katz · Answer 11 · Mon May 07 2018 23:14:28 GMT+0800 (China Standard Time)

Status update: I've cleaned up the script a bit so that we now write

void DEVICE_LAUNCHABLE(ca_compute_temp(...));

around function signatures, and

DEVICE_LAUNCH(ca_compute_temp)(...);

around kernel launches. This generates the kernel and calls the device function ca_compute_temp in a grid-stride loop kernel.

Currently we are still required to manually insert an include to the generated file in the target header, e.g., Castro_F.H must include cuda_Castro_F.H. Next step there is to automate this in the build system.

Next target is to clean up the AMReX Box class by handling the copying of the box indices through this interface generation.

Michael Zingale · Answer 12 · Mon Feb 11 2019 01:35:50 GMT+0800 (China Standard Time)

hasn't this been done?