Success rate of the MLLM layout output (Colab notebook + example outputs)

Question

Success rate of the MLLM layout output (Colab notebook + example outputs)

ShashwatNigam99 opened this issue 4 months ago · comments

Meher Shashwat Nigam commented 4 months ago

Hello authors, thanks for this interesting work and for releasing the code.

I have setup a colab notebook (GPT4) here, where I have been trying out some examples.

I wanted to discuss the frequent case where the MLLM doesn't output a layout that makes sense/ gives incorrect captions to different regions of the layout it outputs. In a few runs with a simple prompt -
"A couple, the beautiful girl on the right, silver hair, braided ponytail, happy, dynamic, energetic, peaceful, the handsome young man on the right detailed gorgeous face, grin, blonde hair, enchanting"
I got the outputs attached below (only one of them makes sense). I wanted to ask if this is expected or if there is a solution to this? Kindly let me know if I am not running the code in the intended way.
Here is a doc with the expanded output (1 success case, 1 failure case).

Ling Yang · Answer 1 · Wed Jan 24 2024 14:38:39 GMT+0800 (China Standard Time)

Thanks for your comments. Currently, our template library is still being organized and not fully open. It's only in its initial version, with continuous updates to follow. Due to the inherent randomness in responses from large language models, the feedback is not guaranteed every time. We will progressively enhance and refine our RPG system to ensure more accurate and stable generated results.

Ling Yang · Answer 2 · Wed Jan 24 2024 21:48:33 GMT+0800 (China Standard Time)

We think the main problem may be your ambiguous text prompt because the girl and the man are both on the right. Another reason is you may neglect to set the variable "use_base", and such setting is critical for generating different objects of the same class, which has been well discussed in the ablation study of our paper.

Meher Shashwat Nigam · Answer 3 · Thu Jan 25 2024 00:46:39 GMT+0800 (China Standard Time)

Hi @YangLing0818 thanks for the reply!
I realize the text is ambiguous, it is one of the example prompts in your README - you might want to update that!

Also, I have been using the base prompt option with a weight of 0.2 (standard regional prompter settings). Thanks for the pointer.