OutOfMemoryError: CUDA out of memory on RunPod

Question

OutOfMemoryError: CUDA out of memory on RunPod

loyal812 opened this issue a month ago · comments

Description:
While running DeepSeek Coder v2 on RunPod, I encountered a CUDA out of memory error. The error message indicated that the system attempted to allocate 20.00 MiB of memory, but GPU 0 had only 2.25 MiB free, despite having a total capacity of 23.67 GiB. The process in question had 23.66 GiB of memory in use, with 19.78 GiB allocated by PyTorch and 3.69 GiB reserved but unallocated by PyTorch.

Error Message:

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.67 GiB of which 2.25 MiB is free. Process 4063579 has 23.66 GiB memory in use. Of the allocated memory 19.78 GiB is allocated by PyTorch, and 3.69 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps to Reproduce:

Deploy DeepSeek Coder v2 on RunPod.
Begin the model training or inference process.
Monitor GPU memory allocation.

Expected Behavior:
The model should run without hitting memory limits, and ideally, it should handle memory allocation more efficiently or provide a mechanism to limit memory usage.

Environment:

RunPod Platform
DeepSeek Coder Version: v2
CUDA Version: 11.8.0
PyTorch Version: 2.1.0