core: CUDA memory monitor left freed memory region in cache

Question

core: CUDA memory monitor left freed memory region in cache

wzamazon opened this issue 2 years ago · comments

Describe the bug
CUDA memory monitor does not monitor the event of CUDA memory being released. Instead if compare the buffer id of input memory region and memory regions in the cache.

Therefore, when a registered memory region was freed (via cudaFree), the registration remained in the cache.

Such a dead region only got removed from the cache when application try to register a new region that overlaps with the dead region.

To Reproduce
Steps to reproduce the behavior:

enable MR cache for cuda memory
register a cuda memory by fi_mr_reg
close the registration
call cudaFree on the memory region
check that the region is still in the cache

Expected behavior
CUDA memory monitor to monitor call to cudaFree, and remove a freed region from cache.

Sean Hefty · Answer 1 · Tue Oct 11 2022 02:54:29 GMT+0800 (China Standard Time)

The memory monitor doesn't allocate the memory, so it can't call free. I agree there's a problem here. Ze has the same issue. AFAICT, only RoCR has the support needed to release the registration by intercepting the free call. Otherwise, the registered memory will remain in the cache until it is flushed.

Wei Zhang · Answer 2 · Tue Oct 11 2022 03:04:16 GMT+0800 (China Standard Time)

The memory monitor doesn't allocate the memory, so it can't call free. I agree there's a problem here. Ze has the same issue.

I think we need a new type of cuda memory monitor, that intercept call to cudaMalloc/cudaFree. like cuda memhooks.

Sean Hefty · Answer 3 · Tue Oct 11 2022 03:22:50 GMT+0800 (China Standard Time)

We just need to extend the existing monitor. How do you intercept the malloc/free calls?

Wei Zhang · Answer 4 · Tue Oct 11 2022 03:52:31 GMT+0800 (China Standard Time)

How do you intercept the malloc/free calls?

I think we can make symbol like cudaFree and cudaMalloc pointing to a new function.

Like what we did for the old memhooks monitor (not the new one based on patcher).

iziemba · Answer 5 · Tue Oct 11 2022 22:23:16 GMT+0800 (China Standard Time)

There was a discussion a few years about about intercepting instead of relying on buffer IDs. I think the issue with the old memhooks intercepting approach being applied to cudaFree/cudaMalloc was it could not handle applications compiled by nvcc where the cudaFree/cudaMalloc calls were static.

iziemba · Answer 6 · Tue Oct 11 2022 22:30:07 GMT+0800 (China Standard Time)

Previous issue where intercepting was discuss: #5789

William Zhang · Answer 7 · Wed Oct 12 2022 02:49:44 GMT+0800 (China Standard Time)

I think adding some context would help here. As far as I understand, with the mr_map code, the mr keys are used for lookup. This can cause an issue when looking up a staled key (new registration gets a new key), but for the wrong region.

Sean Hefty · Answer 8 · Wed Oct 12 2022 03:01:57 GMT+0800 (China Standard Time)

There isn't a problem accessing a stale region.

The problem is that the region, even once freed by the app, remains registered. This results in the memory not actually being released, which can result in future allocations failing. A stale region is not released until the region either falls off the LRU list or we register a new region that overlaps with its memory address.

RoCR has a call to handle this. Both cuda or ze leave a gap here.

iziemba · Answer 9 · Thu Oct 13 2022 06:40:23 GMT+0800 (China Standard Time)

Sorry if I missed this point. As long as we leave the buffer ID checks in place, intercepting cudaFree to also do evictions could work.

github-actions · Answer 10 · Sun Oct 08 2023 18:32:45 GMT+0800 (China Standard Time)

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.