GlueGen

This repository is for the paper:

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
Can Qin ¹, Ning Yu ², Chen Xing ², Shu Zhang ², Zeyuan Chen ², Stefano Ermon ³, Yun Fu ¹, Caiming Xiong ², Ran Xu ²
¹ Northeastern University ² Salesforce AI Research ³ Stanford Univerisy
Work done when Can Qin was an intern at Salesforce AI Research.

With the proposed GlueNet model of the GlueGen framework, the pre-trained image generator (i.e., UNet) can be bridged to off-the-shelf single- or multi-modal encoders to expand their functionalities, i.e., multilingual/sound-to-image generation, within a limited budget. GlueNet is trained offline and does not require back-propagation of UNet and image-text pairs for training. Therefore, GlueGen is flexible and efficient to achieve.

Multilingual Text to Image Generation

Multilingual text-to-image generation results in resolution 512 * 512 of XLM-Roberta + Glue-Net + SDM decoder with the same caption, ``afternoon garden oil painting painted by impressionists".

Sound to Image Generation

Example sound (Urbansound8k) to image generation results.

Sound-text-mix to Image Generation

(a) and (b) are example sound-text-mix to image generation results.

Instruction for GlueGen

Environment Preparation

Setup the env of stable-diffusion first (need to wait a few minutes).

cd ./stable-diffusion
PIP_EXISTS_ACTION=w conda env create -f environment.yaml
conda activate gluegen

Then, install the packages for audioclip.

cd ./stable-diffusion/audioclip
pip install -r requirements.txt
pip install -U llvmlite==0.32.1
pip install -e .

Download Checkpoints

Download the official checkpoints of SD v1 to ./checkpoints_all/checkpoint_sd_v1 as ./checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt (downloaded from https://huggingface.co/runwayml/stable-diffusion-v1-5).

Then follow the README.md (./stable-diffusion/audioclip/README.md) of audioclip to download checkpoints to ./checkpoints_all/audioclip_checkpoint as ./checkpoints_all/audioclip_checkpoint/AudioCLIP-Full-Training.pt.

mkdir ./checkpoints_all/audioclip_checkpoint
cd ./checkpoints_all/audioclip_checkpoint
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

Then download the pretrained gluenet checkpoints and save them to ./checkpoints_all/gluenet_checkpoint:

bash download_gluenet_checkpoints.sh

Download Datasets

Download audio dataset (urbansound8k) to ./data as ./data/urbansound8k

bash download_us8k_data.sh

Download multilingual text dataset to ./data

bash download_multilingual_data.sh

Running Inference Code

Multilingual Stable Diffusion Inference:

cd stable-diffusion

python scripts/txt2img_demo_ml.py --prompt "下午的花园的印象派绘画" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Peinture impressionniste d'un jardin d'après-midi" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Pintura impresionista de un jardín de tarde" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "午後の庭の印象派絵画" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Pittura impressionista di un giardino pomeridiano" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

Sound-to-image Stable Diffusion Inference:

cd stable-diffusion

python scripts/sound2img_gluegen.py --plms --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt --outdir outputs/sound2img --config configs/stable-diffusion/v1-inference-trans-audioclip.yaml --scale 7.5  --n_iter 1 --audioclip_ckpt ../checkpoints_all/audioclip_checkpoint/AudioCLIP-Full-Training.pt

Running Training Code

Sound-to-image GlueNet Training:

cd ./sound-gluenet
CUDA_VISIBLE_DEVICES=0 python train_gluenet_sound_text.py

Multilingual Text-to-image GlueNet Training:

cd ./multilingual-gluenet
CUDA_VISIBLE_DEVICES=0 python train_gluenet_multi.py --DATA_PATH_SRC ../data/WikiMatrix.en-zh.txt.en --DATA_PATH_TAR ../data/WikiMatrix.en-zh.txt.zh --DATA_PATH_SRC_1 ../data/laion-1M-trans-en-zh-cn-en.txt --DATA_PATH_TAR_1 ../data/laion-1M-trans-en-zh-cn-zh-cn.txt --tarLanguage Chinese

Citation

If you find this project useful for your research, please kindly cite our paper:

@article{qin2023gluegen,
  title={GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation},
  author={Qin, Can and Yu, Ning and Xing, Chen and Zhang, Shu and Chen, Zeyuan and Ermon, Stefano and Fu, Yun and Xiong, Caiming and Xu, Ran},
  journal={arXiv preprint arXiv:2303.10056},
  year={2023}
}

Contact

If you have any questions, please contact Can Qin.

Acknowledgement

Stable Diffusion https://github.com/CompVis/stable-diffusion

AudioCLIP https://github.com/AndreyGuzhov/AudioCLIP

WikiMatrix https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix

salesforce / GlueGen