Use a second stream for de-allocating memory
coreylowman opened this issue · comments
Currently all operations, including de-allocations, happen on the default stream. In dfdx, after a long forward pass with many operations (e.g. 100 operations, each producting 1+ gradient), all gradients are captured in a Gradients object. After the forward pass is done, the gradients object is dropped, which means ALL temporary gradients are de-allocated at once.
This blocks the default stream at the moment, so all de-allocations occur before any other work can complete.
Instead, we should put de-allocations on a second stream that is synchronized with the default stream with events:
- call cuEventCreate
- call cuEventRecord with the default stream
- call cuStreamWaitEvent with the event and the deallocation stream
- call free_async with the deallocation stream
This should free up the default stream to continue working
Questions:
- Do we create a new event for each new de-allocation that happens? Or is it possible to have 1 event that the device holds that we use to synchronize?
- How do we free events with cuEventDestroy? And how does cuEventDestroy interact with cuStreamWaitEvent?
Cuda docs state about cuEventDestroy:
An event may be destroyed before it is complete (i.e., while cuEventQuery() would return CUDA_ERROR_NOT_READY). In this case, the call does not block on completion of the event, and any associated resources will automatically be released asynchronously at completion.
Does this mean we can just call cuEventDestroy right after we create, and the stream will still synchronize?
Okay from cuda docs:
Other APIs such as cuStreamWaitEvent() use the most recently captured state at the time of the API call, and are not affected by later calls to cuEventRecord().
This implies to me we can allocate a single event to use!