AMReX-Astro / mini-Castro

a mini-app version of castro

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Write code generation tool for creation of CUDA wrappers to device functions

maxpkatz opened this issue · comments

This issue describes a recommended approach for launching Fortran functions as CUDA kernels. It uses as an example the Castro function

ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi)

In order to launch this function on the device, a CUDA kernel needs to be launched first, and then this function needs to be called inside the kernel as a device function. This should be done by wrapping this function in DEVICE_LAUNCHABLE(), as:

DEVICE_LAUNCHABLE(ca_compute_temp(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi));

(When we're not compiling for the device, this will be a simple C++ preprocessor function macro that does nothing.)

This should be expanded to:

__device__ void ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const Real* state, const int* state_lo, const int* state_hi);

That is, it should prepend __device__ to the target Fortran function (which must have attributes(device) manually prepended to it). It should also create another function declaration prepended with cuda_, that has the same arguments.

The new cuda_ function should look like:

__global__ void cuda_ca_compute_temp
(const int* lo, const int* hi, const amrex::Real* state, const int* state_lo, const int* state_hi)
{
int blo[3];
int bhi[3];
get_loop_bounds(blo, bhi, lo, hi);
ca_compute_temp(blo, bhi, state, state_lo, state_hi);
}

and should be declared in a separate compilation unit, not the header file (a reasonable choice would be a single .cpp file that contains all of the newly created CUDA functions).

Note that get_loop_bounds is a function that is found in AMReX_Device.H.

The corresponding call to this function should be:

DEVICE_LAUNCH(ca_compute_temp(lo, hi, state, state.loVect(), state.hiVect()));

This should be replaced by:

 dim3 numThreads, numBlocks;
 amrex::Device::c_threads_and_blocks(lo, hi, numBlocks, numThreads);
 cuda_ca_compute_temp<<<numBlocks, numThreads, 0, amrex::Device::cudaStream()>>>(lo, hi, state, state.loVect(), state.hiVect());

Note that this makes it a requirement that lo and hi are the first two arguments to the function. This way they can be replaced by the zone index corresponding to each CUDA thread.

One issue is that we often use C++ macros in these calls, e.g.

void ca_compute_temp(const int* lo, const int* hi, const BL_FORT_FAB_ARG_3D(state));

This can make it tricky to know what the actual arguments are. So we need to make sure we run these files through cpp before we generate this.

I can take a pass at implementing this, but I need some code that uses it to test it out.

A nicer -- though more technically challenging -- implementation:

If the function is wrapped with DEVICE_LAUNCH() in the C++ code, then the script will search for the corresponding function declaration in the header, and automatically modify it with __device__ and add the corresponding __global__ CUDA wrapper. This removes the need to mark up both the function call and the header.

Carrying that thought further, using similar machinery, we could also search for the corresponding Fortran function implementation and mark it up with attributes(device) automatically.

In yet more speculative territory, there is the possibility that we could use this to create two versions of the function, one with attributes(device) and one without. This would allow the possibility of also calling it from the host. This can be a "version 2.0" feature but is something that is strongly on my radar.

update: we are not prepending __device__

first pass at this is done, as StarLord/Source/device_script.py

It was pointed out to me last week that a much cleaner approach than using macros is to do something like

#pragma gpu

on the line above a call that we want to execute on the GPU. We could put a corresponding

#pragma gpu_target

on the line above the header declarations.

but we'd still need the script to write the additional interfaces, right?

Yes. This doesn't change any of the requirements, only the syntax we're using.

Status update: I've cleaned up the script a bit so that we now write

void DEVICE_LAUNCHABLE(ca_compute_temp(...));

around function signatures, and

DEVICE_LAUNCH(ca_compute_temp)(...);

around kernel launches. This generates the kernel and calls the device function ca_compute_temp in a grid-stride loop kernel.

Currently we are still required to manually insert an include to the generated file in the target header, e.g., Castro_F.H must include cuda_Castro_F.H. Next step there is to automate this in the build system.

Next target is to clean up the AMReX Box class by handling the copying of the box indices through this interface generation.

hasn't this been done?

Yes.