coreylowman / cudarc

Safe rust wrapper around CUDA toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Different behavior on Debug vs Release

jafioti opened this issue · comments

I'm using cudarc to do tensor operations, and I'm observing that sometimes my output buffers have all zeros in them after the computation. Notably this only happens on release mode, and when I switch back to debug the output is correct?

Is there any known behavior changes here between modes? Perhaps is it because the host code is running too fast, and not synchronizing with the GPU before the data is read back, but in debug the host is slow enough?

I can produce a proper reproduction, but it will take some time since the examples I have are currently pretty intertwined with my project. Just wondering if there are known situations where this behavior can happen.

Also is there a chance that when I define two kernels with the same function and module name,the get_func function won't just default to the most recent function, but could select the other function with the same name?

Can you try explicitly synchronizing before you try to read data?

Right now I'm using the dtoh sync copy function. Is there another way to synchronize?

I've also inserted synchronize() calls after each kernel is launched, but it doesn't seem to help

Now I'm also getting a CUDA_ERROR_ILLEGAL_ADDRESS error when I go to run a kernel. Again, this doesn't happen on debug mode

I remember a while ago I had noticed weird behavior (e.g. unit tests were failing in release but passing in debug) with the DeviceRepr::as_kernel_param, which I had guessed was due to unsoundness of turning a &self reference into a pointer address:

pub unsafe trait DeviceRepr {
    #[inline(always)]
    fn as_kernel_param(&self) -> *mut std::ffi::c_void {
        self as *const Self as *mut _
    }
}

Did you manually implement DeviceRepr for any of the types you're using?

A minimal example would be really helpful

I just found the issue (I believe). Turns out it is an issue with converting the inputs to pointers. I am filling in a dynamic number of inputs, and converting the inputs to pointers works fine in debug mode. In release mode, some optimization is done to get rid of the local pointers (or perhaps they are inlined somehow), and the best part is it's a heisenbug, sometimes it works, sometimes it doesn't. The key is to instead fill out a stack allocated fixed size array ([i32; 10]), and then pass in each element as a pointer offset from the base address of the array.

Like so:

for i in 0..num_inputs {
    input_params.push(inps[0].as_kernel_param() + (i * size_of::<i32>()));
}

and that works fine in release mode.

That was like 5 days of debug lol.

Ok yes I've validated that was the problem, and that is a valid solution. Hope I save someone else a lot of time in the future.