microsoft / i-Code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CoDi : CUDA ran out of memory while trying to do inference tasks

PHOENIXFURY007 opened this issue · comments

I was trying to run the demo notebook on Nvidia A100 80 GB. While trying to load the model from checkpoint, I am facing this issue:
#######################
Running in eps mode
#######################

making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Load pretrained weight from ['CoDi_encoders.pth', 'CoDi_text_diffuser.pth', 'CoDi_audio_diffuser_m.pth', 'CoDi_video_diffuser_8frames.pth']

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.70 GiB total capacity; 17.10 GiB already allocated; 3.56 MiB free; 17.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Can you let me know how to solve this issue ?

I checked with nvidia-smi to see if there were any other running processes, but there was nothing .
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.76 Driver Version: 515.76 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:9E:00.0 Off | 0 |
| N/A 33C P0 46W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

I was able to load the checkpoints , but as I tried to do Text to Video +Audio , it shows the same problem as before.
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.70 GiB total capacity; 21.25 GiB already allocated; 416.56 MiB free; 21.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any way I can run all of the inference tasks on a single A100 80 GB ?