Stable Diffusers for studies

This is yet another Stable Diffusion compilation, aimed to be functional, clean & compact enough for various experiments. There's no GUI here, as the target audience are creative coders rather than post-Photoshop users. The latter may check InvokeAI or AUTOMATIC1111 as a convenient production tool, or Deforum for precisely controlled animations.

The code is based on the diffusers library, with occasional additions from the others mentioned below. The following codebases are partially included here (to ensure compatibility and the ease of setup): k-diffusion, CLIPseg, LPIPS.
There is also a similar repo (kinda obsolete now), based on the CompVis and Stability AI libraries.

Current functions:

Text to image
Image re- and in-painting
Various interpolations (between/upon images or text prompts, smoothed by latent blending)
Guidance with ControlNet (pose, depth, canny edges) and Instruct pix2pix
Smooth & stable video edit with TokenFlow
Text to video with ZeroScope and Potat models

Fine-tuning with your images:

Add subject (new token) with textual inversion
Add subject (new token + Unet delta) with custom diffusion
Add subject (new token + Unet delta) with LoRA (custom implementation)

Other features:

Memory efficient with xformers (hi res on 6gb VRAM GPU)
Multi guidance technique for better interpolations
Use of special models: inpainting, SD v2, Kandinsky
Masking with text via CLIPseg
Weighted multi-prompts (with brackets or numerical weights)
to be continued..

Setup

Install CUDA 11.8 if you're on Windows (seems not necessary on Linux with Conda).
Setup the Conda environment:

conda create -n SD python=3.10 numpy pillow 
activate SD
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install xformers

NB: It's preferrable to install xformers library - to increase performance and to run SD in any resolution on the lower grade hardware (e.g. videocards with 6gb VRAM). However, it's not guaranteed to work with all the (quickly changing) versions of pytorch, hence it's separated from the rest of requirements. If you're on Windows, first ensure that you have Visual Studio 2019 installed.

Run command below to download Stable Diffusion 1.5, 1.5 Dreamlike Photoreal, 2-inpaint, 2.1, 2.1-v, ZeroScope, Potat, custom VAE, ControlNet, instruct-pix2pix, CLIPseg models (converted to float16 for faster loading). Licensing info is available on their webpages.

python download.py

Operations

Examples of usage:

Generate an image from the text prompt:

python src/gen.py --in_txt "hello world" --size 1024-576

Redraw directory of images:

python src/gen.py --in_img _in/pix -t "neon light glow" --strength 0.7

Inpaint directory of images with inpainting model, turning humans into robots:

python src/gen.py -im _in/pix --mask "human, person" -t "steampunk robot" --model 2i

Make a video (frame sequence), interpolating between the lines of the text file:

python src/latwalk.py -t yourfile.txt --size 1024-576

Same, with drawing over a masked image:

python src/latwalk.py -t yourfile.txt -im _in/pix/alex-iby-G_Pk4D9rMLs.jpg --mask "human boy" --invert_mask -m 2i

Same as above, with recursive pan/zoom motion (beware of possible imagery degradation on longer runs):

python src/recur.py -t yourfile.txt --fstep 5 --scale 0.01 -m 15drm

Hallucinate a video, including your real images:

python src/latwalk.py -im _in/pix --cfg_scale 0 -f 1

Interpolations can be made smoother (and faster) by adding --latblend X option (latent blending technique, X in range 0~1). If needed, smooth the result further with FILM.
Models can be selected with --model option by either a shortcut (15, 15drm, 21, 21v, ..), a path on the Hugging Face website (e.g. SG161222/Realistic_Vision_V2.0, would be auto-downloaded for further use) or a local path to the downloaded file set (or safetensors file).
Check other options and their shortcuts by running these scripts with --help option.

There are also few Windows bat-files, slightly simplifying and automating the commands.

Guide synthesis with ControlNet or Instruct pix2pix

Generate an image from existing one, using its depth map as conditioning (extra guiding source):

python src/preproc.py -i _in/something.jpg --type depth -o _in/depth
python src/gen.py --control_mod depth --control_img _in/depth/something.jpg -im _in/something.jpg -t "neon glow steampunk" -f 1

One can replace depth in the commands above with canny (edges) or pose (if there are humans in the source).
Option -im ... may be omitted to employ "pure" txt2img method, pushing the result further to the text prompt:

python src/preproc.py -i _in/something.jpg --type canny -o _in/canny
python src/gen.py --control_mod canny --control_img _in/canny/something.jpg -t "neon glow steampunk" --size 1024-512 --model 15drm

ControlNet options can be used for interpolations as well (fancy making videomapping over a building photo?):

python src/latwalk.py --control_mod canny --control_img _in/canny/something.jpg --control_scale 0.5 -t yourfile.txt --size 1024-512 --fstep 5

also with pan/zoom recursion:

python src/recur.py -cmod canny -cimg _in/canny/something.jpg -cts 0.5 -t yourfile.txt --size 1024-640 -fs 5 -is 12 --scale 0.02 -m 15drm

More ways to edit images

Instruct pix2pix:

python src/gen.py -im _in/pix --img_scale 2 -C 9 -t "turn human to puppet" --model 1p2p

TokenFlow (temporally stable!):

python src/tokenflow.py -im _in/yoursequence -t "rusty metallic sculpture" --batch_size 4 --batch_pivot --cpu

TokenFlow employs either pnp or sde method and can be used with various models & ControlNet options.
NB: this method handles all frames at once (that's why it's so stable). As such, it cannot consume long sequences by design. Pivots batching & CPU offloading (introduced in this repo) pushed the limits, yet didn't removed them. As an example, I managed to process only 300+ frames of 960x540 on a 3090 GPU in batches of 5 without OOM (or without going to the 10x slower shared RAM with new Nvidia drivers).

Fine-tuning

Train new token embedding for a specific subject (e.g. cat) with textual inversion:

python src/train.py --token mycat1 --term cat --data data/mycat1 -lr 0.001 --type text

Finetune the model (namely, part of the Unet attention layers) with LoRA (incompatible with diffusers standard!):

python src/train.py --token mycat1 --term cat --data data/mycat1 -lr 0.0001 --type lora

Do the same with custom diffusion:

python src/train.py --token mycat1 --term cat --data data/mycat1 --term_data data/cat --type custom

Add --style if you're training for a style rather than an object. Speed up custom diffusion with --xformers (LoRA takes care of it on its own); add --low_mem if you get OOM.
Results of the trainings will be saved under train directory.

Custom diffusion trains faster and can achieve impressive reproduction quality (including faces) with simple similar prompts, but it can lose the point on generation if the prompt is too complex or aside from the original category. To train it, you'll need both target reference images (data/mycat1) and more random images of similar subjects (data/cat). Apparently, you can generate the latter with SD itself.
LoRA finetuning seems less precise, while may interfere with wider spectrum of topics.
Textual inversion is more generic but stable. Also, its embeddings can be easily combined together on load.

Generate image with trained embedding and weights from LoRA:

python src/gen.py -t "cosmic <mycat1> beast" --load_lora mycat1-lora.pt

Same with custom diffusion:

python src/gen.py -t "cosmic <mycat1> beast" --load_custom mycat1-custom.pt

Same with textual inversion (you may provide a folder path to load few files at once):

python src/gen.py -t "cosmic <mycat1> beast" --load_token mycat1-text.pt

You can also run python src/latwalk.py ... with finetuned weights to make animations.

Besides special tokens (e.g. <mycat1>) as above, text prompts may include brackets for weighting (like (good) [bad] ((even better)) [[even worse]]).
More radical blending can be achieved with multiguidance technique, introduced here (interpolating predicted noise within diffusion denoising loop, instead of conditioning vectors). It can be used to draw images from complex prompts like good prompt ~1 | also good prompt ~1 | bad prompt ~-0.5 with --cguide option, or for animations with --lguide option (further enhancing smoothness of latent blending). Note that it would slow down generation process.

Special model: SDXL

For now, SDXL is not fully integrated into SDfu core, so there's a separate script, basically wrapping existing diffusers pipelines.
Supported features: txt2img, img2img, inpaint, depth/canny controlnet, text interpolations, dual prompts (native for SDXL).
Unsupported (yet): latent blending, multi guidance, fine-tuning, weighted prompts.
NB: The models (~8gb total) are auto-downloaded on the first use; you may download them yourself and set the path with --models_dir ... option.
As an example, interpolate with ControlNet:

python src/sdxl.py -v -t yourfile.txt -cimg _in/something.jpg -cmod depth -cts 0.6 --size 1280-768 -fs 5

Special model: Kandinsky 2.2

Another interesting model is Kandinsky 2.2, featuring txt2img, img2img, inpaint, depth-based controlnet and simple interpolations. Its architecture and pipelines differ from Stable Diffusion, so there's also a separate script for it, wrapping those pipelines. The options are similar to the above; run python src/kand.py -h to see unused ones. It also consumes only unweighted prompts (no brackets, etc).
NB: The models (heavy!) are auto-downloaded on the first use; you may download them yourself and set the path with --models_dir ... option.
As an example, interpolate with ControlNet:

python src/kand.py -v -t yourfile.txt -cimg _in/something.jpg -cts 0.6 --size 1280-720 -fs 5

Special model: Text to Video

Generate short HD video from a text prompt with ZeroScope model:

python src/vid.py -t "dragon in a China shop" --model vzs

Process existing video in duo (first lo-res ZeroScope, then hi-res Potat):

python src/vid.py -t "combat in the dancehall" --in_vid yourvideo.mp4 --model vzs --model_up vpot

Longer input video would be cut in short pieces and processed one by one. It may be better to process only in hi-res (omitting --model ... option).
NB: the models are limited to rather mundane stuff, don't expect any notable level of abstraction or fantasy here.

Credits

It's quite hard to mention all those who made the current revolution in visual creativity possible. Check the inline links above for some of the sources. Huge respect to the people behind Stable Diffusion, Hugging Face, and the whole open-source movement.

Songchunzhang / SDfu