[CUDA] Optimize GPU memory allocation/deallocation in CUDA Operators such as neighbor sampling.
baoleai opened this issue · comments
🚀 The feature, motivation and pitch
GLT's CUDA Operations such as GPU sampling contains numerous cudaFree calls, which may negatively impact performance. One potential solution is to implement a GPU memory pool to manage memory allocation and deallocation instead of directly calling cudaMalloc(Async)/cudaFree(Async).
Alternatives
No response
Additional context
No response
We can use PyTorch's memory management interface directly, see https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.h