'RuntimeError: CUDA error: an illegal memory access was encountered' with large batch size of GPT2-example
Gy-Lu opened this issue Β· comments
π Describe the bug
When I ran gpt2-vanilla with a batch size of 64, there was a CUDA error RuntimeError: CUDA error: an illegal memory access was encountered
.
Then I printed the memory usage of GPU. At the second iteration, the max allocated memory was 74GB(with torch.max_memory_allocated
), then the error happened, while the allocated memory was no more than 50GB(with torch.memory_allocated
).
It also happened when comes to gpt2-zero3.
I think the peak memory usage was out of memory, while the total memory allocated was not.
This bug may be fixed with PyTorch's update :)
Environment
CUDA/11.3.1
NCCL/2.9.6
Python/3.8.12
PyTorch/1.10.1+cu113
I have tested setting PYTORCH_NO_CUDA_MEMORY_CACHING=1
.
It fails.