Different behavior on Debug vs Release

Question

Different behavior on Debug vs Release

jafioti opened this issue 10 months ago · comments

I'm using cudarc to do tensor operations, and I'm observing that sometimes my output buffers have all zeros in them after the computation. Notably this only happens on release mode, and when I switch back to debug the output is correct?

Is there any known behavior changes here between modes? Perhaps is it because the host code is running too fast, and not synchronizing with the GPU before the data is read back, but in debug the host is slow enough?

I can produce a proper reproduction, but it will take some time since the examples I have are currently pretty intertwined with my project. Just wondering if there are known situations where this behavior can happen.

Joe Fioti · Answer 1 · Fri Sep 22 2023 12:25:08 GMT+0800 (China Standard Time)

Also is there a chance that when I define two kernels with the same function and module name,the get_func function won't just default to the most recent function, but could select the other function with the same name?

Viliam Vadocz · Answer 2 · Fri Sep 22 2023 15:36:00 GMT+0800 (China Standard Time)

Can you try explicitly synchronizing before you try to read data?

Joe Fioti · Answer 3 · Fri Sep 22 2023 21:34:18 GMT+0800 (China Standard Time)

Right now I'm using the dtoh sync copy function. Is there another way to synchronize?

Joe Fioti · Answer 4 · Fri Sep 22 2023 23:54:58 GMT+0800 (China Standard Time)

I've also inserted synchronize() calls after each kernel is launched, but it doesn't seem to help

Joe Fioti · Answer 5 · Sat Sep 23 2023 01:45:23 GMT+0800 (China Standard Time)

Now I'm also getting a CUDA_ERROR_ILLEGAL_ADDRESS error when I go to run a kernel. Again, this doesn't happen on debug mode

Corey Lowman · Answer 6 · Mon Sep 25 2023 21:06:33 GMT+0800 (China Standard Time)

I remember a while ago I had noticed weird behavior (e.g. unit tests were failing in release but passing in debug) with the DeviceRepr::as_kernel_param, which I had guessed was due to unsoundness of turning a &self reference into a pointer address:

pub unsafe trait DeviceRepr {
    #[inline(always)]
    fn as_kernel_param(&self) -> *mut std::ffi::c_void {
        self as *const Self as *mut _
    }
}

Did you manually implement DeviceRepr for any of the types you're using?

A minimal example would be really helpful

Joe Fioti · Answer 7 · Mon Sep 25 2023 23:46:27 GMT+0800 (China Standard Time)

I just found the issue (I believe). Turns out it is an issue with converting the inputs to pointers. I am filling in a dynamic number of inputs, and converting the inputs to pointers works fine in debug mode. In release mode, some optimization is done to get rid of the local pointers (or perhaps they are inlined somehow), and the best part is it's a heisenbug, sometimes it works, sometimes it doesn't. The key is to instead fill out a stack allocated fixed size array ([i32; 10]), and then pass in each element as a pointer offset from the base address of the array.

Like so:

for i in 0..num_inputs {
    input_params.push(inps[0].as_kernel_param() + (i * size_of::<i32>()));
}

and that works fine in release mode.

That was like 5 days of debug lol.

Joe Fioti · Answer 8 · Tue Sep 26 2023 02:13:34 GMT+0800 (China Standard Time)

Ok yes I've validated that was the problem, and that is a valid solution. Hope I save someone else a lot of time in the future.