GLIGEN: Open-Set Grounded Text-to-Image Generation (CVPR 2023)

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li*, Yong Jae Lee* (*Co-senior authors)

[Project Page] [Paper] [Demo] [YouTube Video]

Go beyond text prompt with GLIGEN: enable new capabilities on frozen text-to-image generation models to ground on various prompts, including box, keypoints and images.
GLIGEN’s zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

🔥 News

[2023.03.05] Gradio demo code is released at GLIGEN/demo!
[2023.03.03] Code base and checkpoints are released!
[2023.02.28] Paper is accepted to CVPR 2023!

Requirements

We provide dockerfile to setup environment.

Download GLIGEN models

We provide five checkpoints for different use scenarios. All models here are based on SD-V-1.4.

Mode	Modality	Download
Generation	Box+Text	HF Hub
Generation	Box+Text+Image	HF Hub
Generation	Keypoint	HF Hub
Inpainting	Box+Text	HF Hub
Inpainting	Box+Text+Image	HF Hub

Inference: Generate images with GLIGEN

We provide one script to generate images using provided checkpoints. First download models and put them in gligen_checkpoints. Then run

python gligen_inference.py

Example samples for each checkpoint will be saved in generation_samples. One can check gligen_inference.py for more details about interface.

Training

Grounded generation training

One need to first prepare data for different grounding modality conditions. Refer data for the data we used for different GLIGEN models. Once data is ready, the following command is used to train GLIGEN. (We support multi-GPUs training)

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config

The --yaml_file is the most important argument and below we will use one example to explain key components so that one can be familiar with our code and know how to customize training on their own grounding modalities. The other args are self-explanatory by their names. The experiment will be saved in OUTPUT_ROOT/name

One can refer configs/flicker_text.yaml as one example. One can see that there are 5 components defining this yaml: diffusion, model, autoencoder, text_encoder, train_dataset_names and grounding_tokenizer_input. Typecially, diffusion, autoencoder and text_encoder should not be changed as they are defined by Stable Diffusion. One should pay attention to following:

Within model we add new argument grounding_tokenizer which defines a network producing grounding tokens. This network will be instantized in the model. One can refer to ldm/modules/diffusionmodules/grounding_net_example.py for more details about defining this network.
grounding_tokenizer_input will define a network taking in batch data from dataloader and produce input for the grounding_tokenizer. In other words, it is an intermediante class between dataloader and grounding_tokenizer. One can refer grounding_input/__init__.py for details about defining this class.
train_dataset_names should be listing a serial of names of datasets (all datasets will be concatenated internally, thus it is useful to combine datasets for training). Each dataset name should be first registered in dataset/catalog.py. We have listed all dataset we used; if one needs to train GLIGEN on their own modality dataset, please don't forget first list its name there.

Grounded inpainting training

GLIGEN also supports inpainting training. The following command can be used:

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config --inpaint_mode=True  --ckpt=path_to_an_adapted_model

Typecially, we first train GLIGEN on generation task (e.g., text grounded generation) and this model has 4 channels for input conv (latent space of Stable Diffusion), then we modify the saved checkpoint to 9 channels with addition 5 channels initilized with 0. This continue training can lead to faster convergence and better results. path_to_an_adapted_model refers to this modified checkpoint, convert_ckpt.py can be used for modifying checkpoint. NOTE: yaml file is the same for generation and inpainting training, one only need to change --inpaint_mode

Citation

@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}

Disclaimer

The original GLIGEN was partly implemented and trained during an internship at Microsoft. This repo re-implements GLIGEN in PyTorch with university GPUs after the internship. Despite the minor implementation differences, this repo aims to reproduce the results and observations in the paper for research purposes.

Siri-2001 / GLIGEN