liuguoyou / PixArt-alpha

Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Home Page:https://pixart-alpha.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

👉 PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis


This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring Fast training diffusion models with transformers. You can find more visualizations on our project page.

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen*, Jincheng Yu*, Chongjian Ge*, Lewei Yao*, Enze Xie†, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li
Huawei Noah’s Ark Lab, Dalian University of Technology, HKU, HKUST


🚩 New Features/Updates

  • ✅ Nov. 10, 2023. Support DALL-E 3 Consistency Decoder in 🧨 diffusers.
  • ✅ Nov. 06, 2023. Release pretrained weights with 🧨 diffusers integration, Hugging Face demo, and Google Colab example.
  • ✅ Nov. 03, 2023. Release the LLaVA-captioning inference code.
  • ✅ Oct. 27, 2023. Release the training & feature extraction code.
  • ✅ Oct. 20, 2023. Collaborate with Hugging Face & Diffusers team to co-release the code and weights. (plz stay tuned.)
  • ✅ Oct. 15, 2023. Release the inference code.

🐱 Abstract

TL; DR: PixArt-α is a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), and the training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days).

CLICK for the full abstract The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PixArt-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PixArt-α's training speed markedly surpasses existing large-scale T2I models, e.g., PixArt-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PixArt-α excels in image quality, artistry, and semantic control. We hope PixArt-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

A small cactus with a happy face in the Sahara desert.


🔥🔥🔥 Why PixArt-α?

Training Efficiency

PixArt-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Training Efficiency.

Method Type #Params #Images A100 GPU days
DALL·E Diff 12.0B 1.54B
GLIDE Diff 5.0B 5.94B
LDM Diff 1.4B 0.27B
DALL·E 2 Diff 6.5B 5.63B 41,66
SDv1.5 Diff 0.9B 3.16B 6,250
GigaGAN GAN 0.9B 0.98B 4,783
Imagen Diff 3.0B 15.36B 7,132
RAPHAEL Diff 3.0B 5.0B 60,000
PixArt-α Diff 0.6B 0.025B 675

High-quality Generation from PixArt-α

  • More samples

🔧 Dependencies and Installation

conda create -n pixart python==3.9.0
conda activate pixart
cd path/to/pixart
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt

⏬ Download Models

All models will be automatically downloaded. You can also choose to download manually from this url.

Model #Params url Download in OpenXLab
T5 4.3B T5 T5
VAE 80M VAE VAE
PixArt-α-SAM-256 0.6B 256 256
PixArt-α-512 0.6B 512 or diffuser version 512
PixArt-α-1024 0.6B 1024 or diffuser version 1024

ALSO find all models in OpenXLab_PixArt-alpha

🔥 How to Train

Here we take SAM dataset training config as an example, but of course, you can also prepare your own dataset following this method.

You ONLY need to change the config file in config and dataloader in dataset.

python -m torch.distributed.launch --nproc_per_node=2 --master_port=12345 scripts/train.py configs/pixart_config/PixArt_xl2_img256_SAM.py --work-dir output/train_SAM_256

The directory structure for SAM dataset is:

cd ./data

SA1B
├──images/
│  ├──sa_xxxxx.jpg
│  ├──sa_xxxxx.jpg
│  ├──......
├──partition/
│  ├──part0.txt
│  ├──part1.txt
│  ├──......
├──caption_feature_wmask/
│  ├──sa_xxxxx.npz
│  ├──sa_xxxxx.npz
│  ├──......
├──img_vae_feature/
│  ├──train_vae_256/
│  │  ├──noflip/
│  │  │  ├──sa_xxxxx.npy
│  │  │  ├──sa_xxxxx.npy
│  │  │  ├──......

💻 How to Test

Inference requires at least 23GB of GPU memory using this repo, while 11GB using in 🧨 diffusers.

Currently support:

Quick start with Gradio

To get started, first install the required dependencies, then run on your local machine:

python scripts/interface.py --model_path path/to/model.pth --image_size=1024 --port=12345

Let's have a look at a simple example using the http://your-server-ip:12345.

Using in 🧨 diffusers

Make sure you have the updated versions of the following libraries:

pip install -U transformers accelerate diffusers

And then:

import torch
import os
from diffusers import PixArtAlphaPipeline, ConsistencyDecoderVAE, AutoencoderKL

# You can replace the checkpoint id with "PixArt-alpha/PixArt-XL-2-512x512" too.
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)

# If use DALL-E 3 Consistency Decoder
# pipe.vae = ConsistencyDecoderVAE.from_pretrained("openai/consistency-decoder", torch_dtype=torch.float16)

# Enable memory optimizations.
pipe.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]

Check out the documentation to learn more.

This integration allows running the pipeline with a batch size of 4 under 11 GBs of GPU VRAM. GPU VRAM consumption under 10 GB will soon be supported, too. Stay tuned.

Gradio with diffusers (Faster)

To get started, first install the required dependencies, then run on your local machine:

# diffusers version
DEMO_PORT=12345 python scripts/app.py

Let's have a look at a simple example using the http://your-server-ip:12345.

You can also click here to have a free trial on Google Colab.

Online Demo Hugging Face PixArt

Online Demo sample

✏️ How to LLaVA captioning

Thanks to the code base of LLaVA-Lightning-MPT, we can caption the LAION and SAM dataset with the following launching code:

python scripts/VLM_caption_lightning.py --output output/dir/ --data-root data/root/path --index path/to/data.json

We present auto-labeling with custom prompts for LAION (left) and SAM (right). The words highlighted in green represent the original caption in LAION, while those marked in red indicate the detailed captions labeled by LLaVA.

Dialog with LLaVA.

💪To-Do List

  • Inference code
  • Training code
  • T5 & VAE feature extraction code
  • LLaVA captioning code
  • Model zoo
  • Diffusers version & Hugging Face demo
  • Google Colab example
  • DALLE3 VAE integration
  • SAM-LLaVA caption dataset
  • ControlNet code
  • SA-Solver code

📖BibTeX

@misc{chen2023pixartalpha,
      title={PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis}, 
      author={Junsong Chen and Jincheng Yu and Chongjian Ge and Lewei Yao and Enze Xie and Yue Wu and Zhongdao Wang and James Kwok and Ping Luo and Huchuan Lu and Zhenguo Li},
      year={2023},
      eprint={2310.00426},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🤗Acknowledgements

  • Thanks to Diffusers for their wonderful technical support and awesome collaboration!
  • Thanks to Hugging Face for sponsoring the nicely demo!
  • Thanks to DiT for their wonderful work and codebase!

About

Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

https://pixart-alpha.github.io/

License:GNU Affero General Public License v3.0


Languages

Language:Python 100.0%Language:CSS 0.0%