tensorflow / tpu

Reference models and tools for Cloud TPUs.

Home Page:https://cloud.google.com/tpu/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

running through CUDA OutOfMemory error

M0E313 opened this issue · comments

I'm always getting cuda OutOfMemory error :

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB. GPU 0 has a total capacity of 21.99 GiB of which 6.62 GiB is free. Including non-PyTorch memory, this process has 15.35 GiB memory in use. Of the allocated memory 6.15 GiB is allocated by PyTorch, and 8.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I already reduced batch size and I placed torch.cuda.empty_cache() everywhere in my script, but still not enough...

### I'm using :

pip list | grep cuda

nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

pip list | grep torch

pytorch-lightning 2.1.2
pytorch-triton 3.0.0+989adb9a29
torch 2.2.1+cu118
torchaudio 2.2.1+cu118
torchmetrics 1.3.2
torchvision 0.17.1+cu118