coreylowman / cudarc

Safe rust wrapper around CUDA toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Context management does not work with multiple devices on a single thread

coreylowman opened this issue · comments

This is the bug causing #160 #161 #108

Basically, cuda requires some context to be bound to the current thread. Currently this is bound in CudaDevice::new when result::ctx::set_current is called.

However, when holding onto multiple device references in a single thread, each driver call needs to ensure the current context is it's own before actually doing the work.

This can be achieved by properly calling device.bind_to_thread()?; before making other API calls. I'm not sure how this interacts with other apis such as cublas/cudnn/curand.

Relevant snippet from https://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html#driver-vs-runtime-api:

Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().

Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.

The proposed idea in #108 to have a 2nd struct that enforces calls to bind_to_thread is good, however it doesn't exclude other devices from also calling this. E.g. you could have:

let dev0_bound: BoundDevice = dev0.bind_to_thread()?;
let dev1_bound: BoundDevice = dev1.bind_to_thread()?;
dev0_bound.alloc_zeros(...); // would fail since its using the dev1 context.

I think instead we may want to have an explicit context variable that binding borrows mutably:

let mut ctx = CudaContext::new(); // can just be empty struct I think.
let dev0_bound: BoundDevice = ctx.bind(&dev0);
let dev1_bound: BoundDevice= ctx.bind(&dev1); // would force drop of dev0_bound because BoundDevie has &mut CudaContext

Will need to think about how this changes the API, or if there's another way to do this

I thought about this as a solution too. I think your example is missing &mut dev0 instead to get exclusive access.

I'm not sure there's any rush for this, but this was very trippy for sure.

Especially when trying to write this : https://github.com/coreylowman/cudarc/pull/164/files#diff-5ab8191eb12484f8b1591894776e24e40974a068e7b2d8aa5333dc46f947c35bR279-R294

(The error here triggered in NCCL, while it was my device arrays which were not good because I instantiated every device separately, and then created each array).

The thing with &mut context is that all the slices will end up somehow relying on that handle, meaning the code like this group thing in NCCL is invalid.

I'm not sure how relevant that is in real like since I feel like 1 thread = 1 GPU might be much easier to reason about.

The thing with &mut context is that all the slices will end up somehow relying on that handle, meaning the code like this group thing in NCCL is invalid.

I think we could avoid this actually. Notably the &mut context would only be needed to actually allocate/copy, but you don't need it in the CudaSlice struct.

struct CudaDevice { ... } // same as now
struct CudaSlice { ..., device: Arc<CudaDevice> } // same as now
struct Context;
struct BoundDevice<'a> {
    dev: Arc<CudaDevice>,
    ctx: &'a mut Context,
}
impl<'a> BoundDevice {
     pub fn alloc(&self) -> CudaSlice<T> { ... }
}

For creating a Context, we'd have to do something like thread_context() to make sure people can't just instantiate two separate context objects.

I'm not sure how relevant that is in real like since I feel like 1 thread = 1 GPU might be much easier to reason about.

I'm not sure either, but it seems important to do something differently🤔

The easiest non-breaking change would just be to call bind_to_thread before doing anything. And it should support 1 thread/1 GPU and 1 thread/N GPUs out of the box. The context approach seems like a closer model to the actual underlying logic, but there seem to be a lot of subtleties in getting it right.

I kinda wish all the cuda driver api calls just accepted a context as an argument 🤷