Training on a model other than SD 1.5
dho799 opened this issue · comments
Hi, I'm trying to run the jupyter notebook on runpod, but instead of downloading the 1.5 model from hugging face, I'm trying to download dreamlike-diffusion-1.0 as the model.ckpt file instead. It downloads fine, but when I start training, I get this error: RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 21.10 GiB already allocated; 211.69 MiB free;
21.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Is this because training on models other than SD 1.5 is not supported in the notebook? If so, how can I make adjustments so that I can train on a model other than SD 1.5?
This should work fine on most 1.5 models. Can you try with the latest updates?
I am having the same issue, but using the recommended model
When I start my training , I get the following error,
```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 22.15 GiB already allocated; 16.38 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF````
Tried to setup the env variable but still not working
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
Tested for JoePenna repo on Runpod and Vast Templates.
Vast.ai definitely seems more robust
RUNPOD
runpod/pytorch:3.10-2.0.0-117
No (out of memory error)
runpod/pytorch-3.10-1.13.1-116
Yes
runpod/pytorch-3.9-1.13.1-116
No (ModuleNotFoundError: No module named ‘taming’)
runpod/pytorch-latest (python=3.7, torch=1.12.0)
No (AttributeError: ‘str’ object has no attribute ‘name’ in Cell : Dreambooth Training Environment Setup)
VAST.AI
pytorch:latest (python=3.10.8, torch=1.13.1)
Yes
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime (python3.10.9)
Yes
pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime (python3.10.8)
Yes
pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime (python3.9.2)
Yes
pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime (python3.7.13)
Yes
pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime (python3.7.13)
Yes
pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime (python3.8.12)
Yes
pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime (python3.7.11)
Yes
Training seems to work when the docker images is set as runpod/pytorch
as recommended in the README.md
Training seems to work when the docker images is set as
runpod/pytorch
as recommended in theREADME.md
runpod/pytorch
produces the same env as runpod/pytorch:latest
(torch 1.12.0, python 3.7.13) and produces the same error "AttributeError: 'str' object has no attribute 'name'" in the Training Setup cell.
runpod/pytorch-3.10-1.13.1-116
however does seem to work.
This applies for the latest updated notebook. If running a different or older version then results may differ.
runpod/pytorch-3.10-1.13.1-116 works for me!