Training on a model other than SD 1.5

Question

Training on a model other than SD 1.5

dho799 opened this issue a year ago · comments

Hi, I'm trying to run the jupyter notebook on runpod, but instead of downloading the 1.5 model from hugging face, I'm trying to download dreamlike-diffusion-1.0 as the model.ckpt file instead. It downloads fine, but when I start training, I get this error: RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 21.10 GiB already allocated; 211.69 MiB free;
21.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is this because training on models other than SD 1.5 is not supported in the notebook? If so, how can I make adjustments so that I can train on a model other than SD 1.5?

David B. · Answer 1 · Fri Apr 07 2023 06:44:27 GMT+0800 (China Standard Time)

This should work fine on most 1.5 models. Can you try with the latest updates?

capihacendado · Answer 2 · Sat Apr 08 2023 07:30:35 GMT+0800 (China Standard Time)

I am having the same issue, but using the recommended model

When I start my training , I get the following error,
```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 22.15 GiB already allocated; 16.38 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF````

Tried to setup the env variable but still not working

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

David B. · Answer 3 · Sat Apr 08 2023 07:39:08 GMT+0800 (China Standard Time)

Tested for JoePenna repo on Runpod and Vast Templates.
Vast.ai definitely seems more robust

RUNPOD

runpod/pytorch:3.10-2.0.0-117
No (out of memory error)

runpod/pytorch-3.10-1.13.1-116
Yes

runpod/pytorch-3.9-1.13.1-116
No (ModuleNotFoundError: No module named ‘taming’)

`runpod/pytorch-latest (python=3.7, torch=1.12.0)`
No (AttributeError: ‘str’ object has no attribute ‘name’ in Cell : Dreambooth Training Environment Setup)

VAST.AI

pytorch:latest (python=3.10.8, torch=1.13.1)
Yes

pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime (python3.10.9)
Yes

pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime (python3.10.8)
Yes

pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime (python3.9.2)
Yes

pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime (python3.7.13)
Yes

pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime (python3.7.13)
Yes

pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime (python3.8.12)
Yes

pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime (python3.7.11)
Yes

capihacendado · Answer 4 · Sat Apr 08 2023 08:04:53 GMT+0800 (China Standard Time)

Training seems to work when the docker images is set as runpod/pytorchas recommended in the README.md

yushan777 · Answer 5 · Sat Apr 08 2023 17:43:40 GMT+0800 (China Standard Time)

Training seems to work when the docker images is set as runpod/pytorchas recommended in the README.md

runpod/pytorch produces the same env as runpod/pytorch:latest (torch 1.12.0, python 3.7.13) and produces the same error "AttributeError: 'str' object has no attribute 'name'" in the Training Setup cell.

runpod/pytorch-3.10-1.13.1-116 however does seem to work.

This applies for the latest updated notebook. If running a different or older version then results may differ.

dho799 · Answer 6 · Sun Apr 09 2023 13:52:44 GMT+0800 (China Standard Time)

runpod/pytorch-3.10-1.13.1-116 works for me!

Training on a model other than SD 1.5

RUNPOD

runpod/pytorch-latest (python=3.7, torch=1.12.0) No (AttributeError: ‘str’ object has no attribute ‘name’ in Cell : Dreambooth Training Environment Setup)

VAST.AI

`runpod/pytorch-latest (python=3.7, torch=1.12.0)`
No (AttributeError: ‘str’ object has no attribute ‘name’ in Cell : Dreambooth Training Environment Setup)