Zero length allocation failure
agerasev opened this issue · comments
Hi!
I'm facing an issue with zero length memory allocation (while trying to run candle
on GTX 970). Here is the minimal reproducer:
let dev = cudarc::driver::CudaDevice::new(0).unwrap();
dev.null::<f32>().unwrap();
On my machine it fails with DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
. With this workaround it works fine.
I didn't find documentation for cuMemAlloc_v2
but for cuMemAlloc
it says:
If bytesize is 0, cuMemAlloc() returns CUDA_ERROR_INVALID_VALUE
Maybe cuMemAlloc_v2
shouldn't be called at all if num_bytes
is zero?
My system:
$ uname -a
Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Mon Dec 4 13:40:18 2023
Driver Version : 525.125.06
CUDA Version : 12.0
Attached GPUs : 1
GPU 00000000:03:00.0
Product Name : NVIDIA GeForce GTX 970
Product Brand : GeForce
Product Architecture : Maxwell
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
...
I think this function is behaving as it should - it's returning a result (and the unwrap turns it into a panic). I think this should probably be raised as an issue on candle's repo. Do you know where in candle it's coming from?
Do you know where in candle it's coming from?
It can occur in many places in candle_core::cuda_backend where alloc
or htod_copy
called. There is no checks for zero length here, they are assumed to be successful.
I think this function is behaving as it should - it's returning a result (and the unwrap turns it into a panic).
The problem is that this behavior is inconsistent - it seems that on most devices zero allocation succeeds (and candle relies on this) but on GTX 970 it fails.
I'm not really sure what we can do in this case - this seems like a driver level issue. We don't have any device specific code in cudarc, so I guess I'm not sure what the outcome should be. I'm hesitant to use a null pointer (i.e. not actually call cuMalloc) because I don't really know what the downstream effect of that would be or how the cuda driver interacts with all of those.
Can you print out the CudaDevice in your example? I want to see if the is_async is false
let dev = cudarc::driver::CudaDevice::new(0).unwrap();
println!("{:?}", dev);
Can you print out the CudaDevice in your example?
CudaDevice {
cu_device: 0,
cu_primary_ctx: 0x000055759b945ec0,
stream: 0x0000000000000000,
event: 0x000055759bc8d4f0,
modules: RwLock {
data: {},
poisoned: false,
..
},
ordinal: 0,
is_async: false,
}