[Feature Request]: Paint with words

Question

[Feature Request]: Paint with words

nagolinc opened this issue 2 years ago · comments

Logan commented 2 years ago

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Implement Paint with words

Proposed workflow

Go to (img2img)
Upload mask
Add labels
Get result

Additional information

No response

John · Answer 1 · Mon Nov 07 2022 11:49:59 GMT+0800 (China Standard Time)

That looks very interesting. It basically allows you to compose an entire image with colored masks.
Something similar is possible with inpaint but you'd need to do it in many steps and at a much lower performance.

DepFA · Answer 2 · Mon Nov 07 2022 11:52:32 GMT+0800 (China Standard Time)

The examples look like absolute dogshit, I'm going to be very disappointed if that turns out of be a good demo of a bad system rather than the opposite. Interesting though.

Logan · Answer 3 · Mon Nov 07 2022 12:46:58 GMT+0800 (China Standard Time)

The results from the paper look much better.

DepFA · Answer 4 · Mon Nov 07 2022 13:52:25 GMT+0800 (China Standard Time)

Yes so much more drastically better and cohesive it makes me doubt that a version against SD is going to be effective.

Litevex · Answer 5 · Mon Nov 07 2022 21:25:46 GMT+0800 (China Standard Time)

Even if those 2 results look slightly weird, it still works somewhat (even comparable to make-a-scene). However, there's the question of how you would draw a mask assigned to words for this in gradio.

DepFA · Answer 6 · Mon Nov 07 2022 21:44:30 GMT+0800 (China Standard Time)

There's a colour canvas drawing facility in gradio it's just disabled by default as it was breaking layouts and generally misbehaving. from there you can get the unique colours and ask for text tagging, the required new callback hooks into the model are going to need a very convincing and powerful results though.

Cookie · Answer 7 · Mon Nov 07 2022 21:54:20 GMT+0800 (China Standard Time)

The current implementation is definitely not working how we'd expect it to (check "Comparison" section at the bottom of this comment).

I've uploaded some results if anyone wants to experiment with me.

labeled and unlabeled are input images.
w0.4_log1p(sigma)_maxnorm is the above repo with stock settings.

The implementation in the paper was found empirically, so it's likely we can also find a good configuration by simply playing around.

Comparison

Input

paint-with-words-sd with Stock settings (prompt and image)

paint-with-words-sd with only prompt (no image)

Image Colors/Tokens

EXAMPLE_SETTING_1 = {
    "color_context": {
        ( 48, 167,  26): "purple trees,1.0",
        (115, 232, 103): "abandoned city,1.0",
        (100, 121, 135): "road,1.0",
        (133,  94, 253): "grass,1.0",
        (  1,  47,  71): "magical portal,1.0",
        ( 38, 192, 212): "starry night,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A dramatic oil painting of a road.png",
    "input_prompt": "A dramatic oil painting of a road from a magical portal to an abandoned city with purple trees and grass in a starry night.",
    "output_dir_path": "benchmark/example_1",
}

EXAMPLE_SETTING_2 = {
    "color_context": {
        (161, 160, 173): "A large red moon,1.0",
        ( 79,  18,  96): "Bats,1.0",
        ( 82, 170,  20): "sky,1.0",
        (  0, 232, 126): "an evil pumpkin,1.0",
        (180,   0, 137): "zombies,1.0",
        (129,  65,   0): "tombs,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A Halloween scene of an evil pumpkin.png",
    "input_prompt": "A Halloween scene of an evil pumpkin. A large red moon in the sky. Bats are flying and zombies are walking out of tombs. Highly detailed fantasy art.",
    "output_dir_path": "benchmark/example_2",
}

EXAMPLE_SETTING_3 = {
    "color_context": {
        (  7, 192, 152): "dark cellar,1.0",
        ( 81,  31,  97): "monster,1.0",
        ( 71, 132,   2): "teddy bear,1.0",
        ( 32, 115, 189): "table,1.0",
        ( 70,  53, 108): "dungeons and dragons,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A monster and a teddy brear playing dungeons and dragons.png",
    "input_prompt": "A monster and a teddy bear playing dungeons and dragons around a table in a dark cellar. High quality fantasy art.",
    "output_dir_path": "benchmark/example_3",
}

EXAMPLE_SETTING_4 = {
    "color_context": {
        (138,  48,  39): "rabbit mage,1.0",
        ( 50,  32, 211): "fire ball,1.0",
        (126, 200, 100): "clouds,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A rabbit mage standing on clouds casting a fireball.png",
    "input_prompt": "A highly detailed digital art of a rabbit mage standing on clouds casting a fire ball.",
    "output_dir_path": "benchmark/example_4",
}

EXAMPLE_SETTING_5 = {
    "color_context": {
        (157, 187, 242): "rainbow beams,1.0",
        ( 27, 165, 234): "forest,1.0",
        ( 57, 244,  30): "A red Ferrari car,1.0",
        (151, 138,  41): "gravel road,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A red Ferrari car driving on a gravel road.png",
    "input_prompt": "A red Ferrari car driving on a gravel road in a forest with rainbow beams in the distance.",
    "output_dir_path": "benchmark/example_5",
}

EXAMPLE_SETTING_6 = {
    "color_context": {
        (123, 141, 146): "bar,1.0",
        ( 90, 119,  35): "red boxing gloves,1.0",
        ( 48, 167,  26): "blue boxing gloves,1.0",
        ( 10, 216, 129): "A squirrel,1.0",
        ( 72,  38,  31): "a squirrel,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A squirrel and a squirrel with boxing gloves fighting in a bar.png",
    "input_prompt": "A squirrel with red boxing gloves and a squirrel with blue boxing gloves fighting in a bar.",
    "output_dir_path": "benchmark/example_6",
}

Cookie · Answer 8 · Mon Nov 07 2022 23:57:39 GMT+0800 (China Standard Time)

@nagolinc

Here's a more direct comparison. (3rd column is stable-diffusion+paint_with_words, 4th column is stable-diffusion)
This is best-of 4 attempts for each image.

I can definitely see merit in adding this feature.

In row 1 stable-diffusion forgets the purple trees and portal without guidance.
In row 2 none of the images produced zombies and only 1 image had a pumpkin in the centre.
In row 3 stable-diffusion failed to produce a separate zombie and teddy bear, it would always be one or the other, or some weird hybrid creature.
In row 4 stable-diffusion failed to generate the cloud in all attempts
In row 5, all of the stable-diffusion rainbows were vertical for some reason? I guess horizontal rainbows don't happen in real life at such a low height.
In row 6 stable-diffusion consistently forget about the bar in the background and in one attempt was missing a squirrel. Both stable-diffusion methods failed to produce a blue boxing glove however, which is surprising. I suspect it's an implementation issue since I don't see anything to separate squirrel A and squirrel B in the code.

(Also, these samples were done without any weighting at all for prompt tokens or paint-with-words tokens. If you reduce the weight of background elements and increase the weight of unique elements I'm sure it'll work even better.)

Mykeehu · Answer 9 · Tue Nov 08 2022 04:49:53 GMT+0800 (China Standard Time)

Here is a solution, but you need to rewrite it to integrate or for extension

152334H · Answer 10 · Tue Nov 08 2022 08:47:15 GMT+0800 (China Standard Time)

@mykeehu It was linked in the initial issue description.

Simo Ryu · Answer 11 · Tue Nov 08 2022 13:45:29 GMT+0800 (China Standard Time)

Wow @CookiePPP thank you so much for the comparisons! I would be so happy if you let me use your benchmarks on my paint-with-words repo as well, do you mind if I do?

Cookie · Answer 12 · Tue Nov 08 2022 18:53:28 GMT+0800 (China Standard Time)

@cloneofsimo
Feel free to use anything I posted on this thread.

Simo Ryu · Answer 13 · Tue Nov 08 2022 19:18:23 GMT+0800 (China Standard Time)

@cloneofsimo Feel free to use anything I posted on this thread.

Thank you so much! I will certainly credit you when I add some of these materials!
And another question: were these generated with default ( 0.4* log(1+sigma)*max(QK)) values?

Cookie · Answer 14 · Tue Nov 08 2022 19:24:49 GMT+0800 (China Standard Time)

@cloneofsimo
These were generated with 2.0*log(1+sigma)*std(QK).
I've only tested 14 combinations so be aware this configuration is probably still not the best that can be found.

Simo Ryu · Answer 15 · Tue Nov 08 2022 19:27:43 GMT+0800 (China Standard Time)

Awesome! Thank you so much for sharing that information as well! I agree 100% that certainly much better configuration could be possible as the model structure differs with eDiffi.

Simo Ryu · Answer 16 · Wed Nov 09 2022 03:15:49 GMT+0800 (China Standard Time)

@CookiePPP Based on your findings, repo I've added user-defined weight scaling function as well as some findings in my repo. I hope this feature gets added to A1111's repo as well

Mykeehu · Answer 17 · Wed Nov 09 2022 03:53:51 GMT+0800 (China Standard Time)

I would also be happy if someone wrote it as an extension, because so many models, so many label sets and styles, that would really expand the possibilities of SD. I suppose the interesting thing would be how to do the color label table, because it has to be passed to the generator, the simple prompt would not be good. I also wonder how much can be solved with scripts only, as in the case of multiprompt scripts. Unfortunately I'm not a programmer to rewrite it, I'm just looking forward to it and brainstorming.

John · Answer 18 · Wed Nov 09 2022 03:57:41 GMT+0800 (China Standard Time)

The eDiffi ones are painfully good, Nvidia simply has an advantage: they have near unlimited GPU processing power and it only costs them a bit of energy to use it.

John · Answer 19 · Wed Nov 09 2022 04:00:13 GMT+0800 (China Standard Time)

One idea for improvement:
What if we'd generate unique noise for each of the word zones ?
Then when re-rolling we could choose which words we want to re-roll and which should stay.

So if the D&D table or the blue gloves are wrong -> reroll only the noise in that area and keep the other noise zones identical.
Of course that would mean quite a few adaptions in the UI, as we'd need a flexible number of noise inputs mapped to all the paint-words with checkboxes for "reroll"

Simo Ryu · Answer 20 · Wed Nov 09 2022 04:11:08 GMT+0800 (China Standard Time)

Well, eDiffi does use literally three different conditionings and something like 5 times more parameters, so...

Just in case anyone is interested, there are LOT of room for improvements in my implementation

as @CookiePPP mentioned above, there is no separation within same words
I don't think this hard-coded 0-1 based weights are good, because some words are just tiny bit different but entirely ignored, hence using something like nn-based word similarity to build better, continous cross attention maps

If you are going to rebuild this feature into A1111's, there are probably better ways to construct cross attention weights than mine. (I will be working to get it right eventually though, my repo got way more attention than it deserves lol)

Simo Ryu · Answer 21 · Wed Nov 09 2022 04:12:31 GMT+0800 (China Standard Time)

One idea for improvement: What if we'd generate unique noise for each of the word zones ? Then when re-rolling we could choose which words we want to re-roll and which should stay.

So if the D&D table or the blue gloves are wrong -> reroll only the noise in that area and keep the other noise zones identical. Of course that would mean quite a few adaptions in the UI, as we'd need a flexible number of noise inputs mapped to all the paint-words with checkboxes for "reroll"

I think this is very good idea. I think it might work + easy to implement. If this works i'll add this feature in the future as well

Jonathan · Answer 22 · Fri Nov 18 2022 07:18:18 GMT+0800 (China Standard Time)

I can help with implementing an extension - I just need instructions on what methods would need to changed. I have some ML literacy so just knowing about where the interaction occurs would be enough for me to try an implementation with LDM

Mykeehu · Answer 23 · Sun Jan 01 2023 01:18:41 GMT+0800 (China Standard Time)

I'd love to know what the status of the development of this extension is, because I'm really looking forward to it!

István Nagy · Answer 24 · Sat Jan 07 2023 07:11:10 GMT+0800 (China Standard Time)

I've tried to collect what needs to be done here, but I'm still trying to understand how webUI and all the related tech works, so this is probably just a vague draft. Can someone fix this/extend upon please?

The current version of the paint-with-words-sd code doesn't handle natively ckpt files, just with pre-conversion. Something needs to be done here so it can work with ckpt directly. Is this complicated?
The Gradio UI doesn't have a good canvas editor with labeling. It's possible to create a new component for Gradio using the Svelte framework, actually I've found some canvas drawing example here which could server as a basis. But in another Reddit thread someone mentioned a library called konvajs which is mostly dependency free and can be integrated with Svelte rather easily. I've found an example here which allows drawing arbitrary shaped labels over an image, this is close to what we need.
The whole thing can be implemented into an extension and appear as a whole new tab in webui, like eg. image browser. Probably some part of the img2img code can be reused for this if I'm not mistaken.

Mykeehu · Answer 25 · Sat Jan 07 2023 18:55:30 GMT+0800 (China Standard Time)

@nistvan86 an alternative solution is what the openOutpaint extension does, that it uses a standalone interface through the api and then it is not tied to gradio.
However, starting from img2img in gradio could possibly be a good base, but you need a textbox or a tabular interface to put the color code in and next to it the prompt line by line to make it an array for input.
I'm not a programmer, just trying to find a solution for this through existing stuff.

István Nagy · Answer 26 · Sat Jan 07 2023 20:06:30 GMT+0800 (China Standard Time)

@mykeehu an even better solution would be to avoid typing the same prompt elements multiple times and attach the color codes to sections of the prompt with a special syntax, like how you could currently (emphasize things):1.2.

But I'm not sure how well that could be implemented UX wise. One solution could be for example to select a part of the prompt and the drag & drop it onto a colored shape on the canvas.

Mykeehu · Answer 27 · Sun Jan 08 2023 01:56:49 GMT+0800 (China Standard Time)

Or when you choose a colour and start painting with it, the colour is added to the list. But how can the colour be changed afterwards? Or if a color is not needed, delete it, so you can manage layers.

István Nagy · Answer 28 · Sun Jan 08 2023 03:17:19 GMT+0800 (China Standard Time)

The more I think about it, a vector graphics based editor would probably serve the needs better here. An SVG for example could be shown in the browser easily, it's DOM can be interacted with, the SVG can be saved next to the image in a rather small size (or even embedded into the resulting PNG, so it can be restored the same way you can load back config from an output).
It can be also altered afterwards, polygon's can be moved or the draw order can be changed.
It can be also rendered in whatever size and format needed on the Python side, no need for upscaling a raster image.
(eg. svg.draw.js might work, with CairoSVG on the backend)

István Nagy · Answer 29 · Sun Jan 08 2023 06:42:21 GMT+0800 (China Standard Time)

Maybe worth noting there's a fork of the paint-with-words repository which uses transformer pipelines: paint-with-words-pipelines.

I've created a minimal example on top of it which can be run on Windows in a venv. (see steps.txt included)
paint-with-words-2.zip

gsgoldma · Answer 30 · Mon Feb 20 2023 02:24:05 GMT+0800 (China Standard Time)

Maybe worth noting there's a fork of the paint-with-words repository which uses transformer pipelines: paint-with-words-pipelines.

I've created a minimal example on top of it which can be run on Windows in a venv. (see steps.txt included) paint-with-words-2.zip

I got this to work with 6 gigs vram.
weirdly enough, the original paint with words didn't work with my PC, and neither did the test.py in the zip.
but when I put made the venv in the old paint folder, the runner.py file worked!

Li-Wei Chen · Answer 31 · Sat Mar 18 2023 14:24:31 GMT+0800 (China Standard Time)

I update the paint with word extension at Paint with Word , combining ControlNet and Paint with Word (PwW).

See the results below

One can also use pure PwW by setting the weight of ControlNet to 0

Mykeehu · Answer 32 · Sun Mar 19 2023 14:30:56 GMT+0800 (China Standard Time)

@lwchen6309 please check your version, I have a conflict with Controlnet. My bug is here.

Li-Wei Chen · Answer 33 · Sun Mar 19 2023 15:39:52 GMT+0800 (China Standard Time)

@mykeehu thanks for reporting this bug. I raise this issue here for further discussion.

catboxanon · Answer 34 · Mon Aug 07 2023 23:26:14 GMT+0800 (China Standard Time)

Closing as the extension mentioned above has been available for quite some time. https://github.com/lwchen6309/paint-with-words-sd