ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

core: CUDA memory monitor left freed memory region in cache

wzamazon opened this issue · comments

Describe the bug
CUDA memory monitor does not monitor the event of CUDA memory being released. Instead if compare the buffer id of input memory region and memory regions in the cache.

Therefore, when a registered memory region was freed (via cudaFree), the registration remained in the cache.

Such a dead region only got removed from the cache when application try to register a new region that overlaps with the dead region.

To Reproduce
Steps to reproduce the behavior:

  1. enable MR cache for cuda memory
  2. register a cuda memory by fi_mr_reg
  3. close the registration
  4. call cudaFree on the memory region
  5. check that the region is still in the cache

Expected behavior
CUDA memory monitor to monitor call to cudaFree, and remove a freed region from cache.

The memory monitor doesn't allocate the memory, so it can't call free. I agree there's a problem here. Ze has the same issue. AFAICT, only RoCR has the support needed to release the registration by intercepting the free call. Otherwise, the registered memory will remain in the cache until it is flushed.

The memory monitor doesn't allocate the memory, so it can't call free. I agree there's a problem here. Ze has the same issue.

I think we need a new type of cuda memory monitor, that intercept call to cudaMalloc/cudaFree. like cuda memhooks.

We just need to extend the existing monitor. How do you intercept the malloc/free calls?

How do you intercept the malloc/free calls?

I think we can make symbol like cudaFree and cudaMalloc pointing to a new function.

Like what we did for the old memhooks monitor (not the new one based on patcher).

There was a discussion a few years about about intercepting instead of relying on buffer IDs. I think the issue with the old memhooks intercepting approach being applied to cudaFree/cudaMalloc was it could not handle applications compiled by nvcc where the cudaFree/cudaMalloc calls were static.

Previous issue where intercepting was discuss: #5789

I think adding some context would help here. As far as I understand, with the mr_map code, the mr keys are used for lookup. This can cause an issue when looking up a staled key (new registration gets a new key), but for the wrong region.

There isn't a problem accessing a stale region.

The problem is that the region, even once freed by the app, remains registered. This results in the memory not actually being released, which can result in future allocations failing. A stale region is not released until the region either falls off the LRU list or we register a new region that overlaps with its memory address.

RoCR has a call to handle this. Both cuda or ze leave a gap here.

Sorry if I missed this point. As long as we leave the buffer ID checks in place, intercepting cudaFree to also do evictions could work.

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.