Würstchen

What is this?

Würstchen is a new framework for training text-conditional models by moving the computationally expensive text-conditional stage into a highly compressed latent space. Common approaches make use of a single stage compression, while Würstchen introduces another Stage that introduces even more compression. In total we have Stage A & B that are responsible for compressing images and Stage C that learns the text-conditional part in the low dimensional latent space. With that Würstchen achieves a 42x compression factor, while still reconstructing images faithfully. This enables training of Stage C to be fast and computationally cheap. We refer to the paper for details.

Use Würstchen

You can use the model simply through the notebooks here. The Stage B notebook only for reconstruction and the Stage C notebook is for the text-conditional generation. You can also try the text-to-image generation on Google Colab.

Using in 🧨 diffusers

Würstchen is fully integrated into the diffusers library. Here's how to use it:

# pip install -U transformers accelerate diffusers

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
    caption, 
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=2,
).images

Refer to the official documentation to learn more.

Train your own Würstchen

Training Würstchen is considerably faster and cheaper than other text-to-image as it trains in a much smaller latent space of 12x12. We provide training scripts for both Stage B and Stage C.

Download Models

Model	Download	Parameters	Conditioning	Training Steps	Resolution
Würstchen v1	Hugging Face	1B (Stage C) + 600M (Stage B) + 19M (Stage A)	CLIP-H-Text	800.000	512x512
Würstchen v2	Hugging Face	1B (Stage C) + 600M (Stage B) + 19M (Stage A)	CLIP-bigG-Text	918.000	1024x1024

Acknowledgment

Special thanks to Stability AI for providing compute for our research.

About

Official implementation of Würstchen: Efficient Pretraining of Text-to-Image Models

https://arxiv.org/abs/2306.00637

MIT License

Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%Language:Shell 0.0%