YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)

Home Page:https://arxiv.org/abs/2401.11708

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Success rate of the MLLM layout output (Colab notebook + example outputs)

ShashwatNigam99 opened this issue · comments

Hello authors, thanks for this interesting work and for releasing the code.

I have setup a colab notebook (GPT4) here, where I have been trying out some examples.

I wanted to discuss the frequent case where the MLLM doesn't output a layout that makes sense/ gives incorrect captions to different regions of the layout it outputs. In a few runs with a simple prompt -
"A couple, the beautiful girl on the right, silver hair, braided ponytail, happy, dynamic, energetic, peaceful, the handsome young man on the right detailed gorgeous face, grin, blonde hair, enchanting"
I got the outputs attached below (only one of them makes sense). I wanted to ask if this is expected or if there is a solution to this? Kindly let me know if I am not running the code in the intended way.
Here is a doc with the expanded output (1 success case, 1 failure case).

gpt4_image_20240124_001743
gpt4_image_20240124_050308
gpt4_image_20240124_051546

Thanks for your comments. Currently, our template library is still being organized and not fully open. It's only in its initial version, with continuous updates to follow. Due to the inherent randomness in responses from large language models, the feedback is not guaranteed every time. We will progressively enhance and refine our RPG system to ensure more accurate and stable generated results.

We think the main problem may be your ambiguous text prompt because the girl and the man are both on the right. Another reason is you may neglect to set the variable "use_base", and such setting is critical for generating different objects of the same class, which has been well discussed in the ablation study of our paper.

Hi @YangLing0818 thanks for the reply!
I realize the text is ambiguous, it is one of the example prompts in your README - you might want to update that!

Also, I have been using the base prompt option with a weight of 0.2 (standard regional prompter settings). Thanks for the pointer.