Better stream support in the CuPy's default memory pool

Question

Better stream support in the CuPy's default memory pool

kmaehashi opened this issue 7 months ago · comments

Currently, CuPy's default memory pool arena is managed per stream. This leads to several difficulties in working with multiple CUDA streams:

It is not possible for an ndarray (memory chunk) to declare a dependency on the specific stream. For example, if memory is allocated on stream A and then passed to a kernel to be executed on stream B, the memory chunk should not be reused until the kernel finishes in stream B. However, it is impossible for users to declare such dependencies; race conditions may happen if the user destructs the ndarray object.
The memory allocated in the arena for the specific stream won't be freed automatically after the stream gets destructed. Currently users are required to call mempool.free_all_blocks() after destructing a stream.

PyTorch provides record_stream feature to support (1). This is done by recording an event to the dependent stream and checking if the event completes on every allocation. I think it's better to have a similar API in CuPy as well. (2) should also be resolved using a similar mechanism.

Leo Fang · Answer 1 · Thu Jan 04 2024 02:03:00 GMT+0800 (China Standard Time)

However, it is impossible for users to declare such dependencies; race conditions may happen if the user destructs the ndarray object.

I thought our position was to ask users to declare stream order themselves, using Stream.wait_event? cuQuantum Python does this extensively to support multiple streams. Has anything changed?

The memory allocated in the arena for the specific stream won't be freed automatically after the stream gets destructed.

Yes. This actually could lead to suspicious performance jittering (it takes time to run free_all_blocks) or even premature OOM (we've seen areas not reclaimable, but it's a bit hard to reproduce, let me dig out a reproducer).