JoePenna / Dreambooth-Stable-Diffusion

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) by way of Textual Inversion (https://arxiv.org/abs/2208.01618) for Stable Diffusion (https://arxiv.org/abs/2112.10752). Tweaks focused on training faces, objects, and styles.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training on a model other than SD 1.5

dho799 opened this issue · comments

commented

Hi, I'm trying to run the jupyter notebook on runpod, but instead of downloading the 1.5 model from hugging face, I'm trying to download dreamlike-diffusion-1.0 as the model.ckpt file instead. It downloads fine, but when I start training, I get this error: RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 21.10 GiB already allocated; 211.69 MiB free;
21.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is this because training on models other than SD 1.5 is not supported in the notebook? If so, how can I make adjustments so that I can train on a model other than SD 1.5?

This should work fine on most 1.5 models. Can you try with the latest updates?

I am having the same issue, but using the recommended model
image

When I start my training , I get the following error,
```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 22.15 GiB already allocated; 16.38 MiB free; 22.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF````

Tried to setup the env variable but still not working

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

Tested for JoePenna repo on Runpod and Vast Templates.
Vast.ai definitely seems more robust

RUNPOD

runpod/pytorch:3.10-2.0.0-117
No (out of memory error)

runpod/pytorch-3.10-1.13.1-116
Yes

runpod/pytorch-3.9-1.13.1-116
No (ModuleNotFoundError: No module named ‘taming’)

runpod/pytorch-latest (python=3.7, torch=1.12.0)
No (AttributeError: ‘str’ object has no attribute ‘name’ in Cell : Dreambooth Training Environment Setup)

VAST.AI

pytorch:latest (python=3.10.8, torch=1.13.1)
Yes

pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime (python3.10.9)
Yes

pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime (python3.10.8)
Yes

pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime (python3.9.2)
Yes

pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime (python3.7.13)
Yes

pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime (python3.7.13)
Yes

pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime (python3.8.12)
Yes

pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime (python3.7.11)
Yes

Training seems to work when the docker images is set as runpod/pytorchas recommended in the README.md

Training seems to work when the docker images is set as runpod/pytorchas recommended in the README.md

runpod/pytorch produces the same env as runpod/pytorch:latest (torch 1.12.0, python 3.7.13) and produces the same error "AttributeError: 'str' object has no attribute 'name'" in the Training Setup cell.

runpod/pytorch-3.10-1.13.1-116 however does seem to work.

This applies for the latest updated notebook. If running a different or older version then results may differ.

commented

runpod/pytorch-3.10-1.13.1-116 works for me!