AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request]: Paint with words

nagolinc opened this issue · comments

commented

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Implement Paint with words

Proposed workflow

  1. Go to (img2img)
  2. Upload mask
  3. Add labels
  4. Get result

Additional information

No response

commented

That looks very interesting. It basically allows you to compose an entire image with colored masks.
Something similar is possible with inpaint but you'd need to do it in many steps and at a much lower performance.

commented

The examples look like absolute dogshit, I'm going to be very disappointed if that turns out of be a good demo of a bad system rather than the opposite. Interesting though.

commented

The results from the paper look much better.
Screenshot 2022-11-06 234612

commented

Yes so much more drastically better and cohesive it makes me doubt that a version against SD is going to be effective.

Even if those 2 results look slightly weird, it still works somewhat (even comparable to make-a-scene). However, there's the question of how you would draw a mask assigned to words for this in gradio.

commented

There's a colour canvas drawing facility in gradio it's just disabled by default as it was breaking layouts and generally misbehaving. from there you can get the unique colours and ask for text tagging, the required new callback hooks into the model are going to need a very convincing and powerful results though.

The current implementation is definitely not working how we'd expect it to (check "Comparison" section at the bottom of this comment).

I've uploaded some results if anyone wants to experiment with me.

labeled and unlabeled are input images.
w0.4_log1p(sigma)_maxnorm is the above repo with stock settings.

The implementation in the paper was found empirically, so it's likely we can also find a good configuration by simply playing around.


Comparison

Input
A dramatic oil painting of a road
paint-with-words-sd with Stock settings (prompt and image)
0
paint-with-words-sd with only prompt (no image)
0

Image Colors/Tokens
EXAMPLE_SETTING_1 = {
    "color_context": {
        ( 48, 167,  26): "purple trees,1.0",
        (115, 232, 103): "abandoned city,1.0",
        (100, 121, 135): "road,1.0",
        (133,  94, 253): "grass,1.0",
        (  1,  47,  71): "magical portal,1.0",
        ( 38, 192, 212): "starry night,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A dramatic oil painting of a road.png",
    "input_prompt": "A dramatic oil painting of a road from a magical portal to an abandoned city with purple trees and grass in a starry night.",
    "output_dir_path": "benchmark/example_1",
}

EXAMPLE_SETTING_2 = {
    "color_context": {
        (161, 160, 173): "A large red moon,1.0",
        ( 79,  18,  96): "Bats,1.0",
        ( 82, 170,  20): "sky,1.0",
        (  0, 232, 126): "an evil pumpkin,1.0",
        (180,   0, 137): "zombies,1.0",
        (129,  65,   0): "tombs,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A Halloween scene of an evil pumpkin.png",
    "input_prompt": "A Halloween scene of an evil pumpkin. A large red moon in the sky. Bats are flying and zombies are walking out of tombs. Highly detailed fantasy art.",
    "output_dir_path": "benchmark/example_2",
}

EXAMPLE_SETTING_3 = {
    "color_context": {
        (  7, 192, 152): "dark cellar,1.0",
        ( 81,  31,  97): "monster,1.0",
        ( 71, 132,   2): "teddy bear,1.0",
        ( 32, 115, 189): "table,1.0",
        ( 70,  53, 108): "dungeons and dragons,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A monster and a teddy brear playing dungeons and dragons.png",
    "input_prompt": "A monster and a teddy bear playing dungeons and dragons around a table in a dark cellar. High quality fantasy art.",
    "output_dir_path": "benchmark/example_3",
}

EXAMPLE_SETTING_4 = {
    "color_context": {
        (138,  48,  39): "rabbit mage,1.0",
        ( 50,  32, 211): "fire ball,1.0",
        (126, 200, 100): "clouds,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A rabbit mage standing on clouds casting a fireball.png",
    "input_prompt": "A highly detailed digital art of a rabbit mage standing on clouds casting a fire ball.",
    "output_dir_path": "benchmark/example_4",
}

EXAMPLE_SETTING_5 = {
    "color_context": {
        (157, 187, 242): "rainbow beams,1.0",
        ( 27, 165, 234): "forest,1.0",
        ( 57, 244,  30): "A red Ferrari car,1.0",
        (151, 138,  41): "gravel road,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A red Ferrari car driving on a gravel road.png",
    "input_prompt": "A red Ferrari car driving on a gravel road in a forest with rainbow beams in the distance.",
    "output_dir_path": "benchmark/example_5",
}

EXAMPLE_SETTING_6 = {
    "color_context": {
        (123, 141, 146): "bar,1.0",
        ( 90, 119,  35): "red boxing gloves,1.0",
        ( 48, 167,  26): "blue boxing gloves,1.0",
        ( 10, 216, 129): "A squirrel,1.0",
        ( 72,  38,  31): "a squirrel,1.0",
    },
    "color_map_img_path": "benchmark/unlabeled/A squirrel and a squirrel with boxing gloves fighting in a bar.png",
    "input_prompt": "A squirrel with red boxing gloves and a squirrel with blue boxing gloves fighting in a bar.",
    "output_dir_path": "benchmark/example_6",
}

@nagolinc

Here's a more direct comparison. (3rd column is stable-diffusion+paint_with_words, 4th column is stable-diffusion)
This is best-of 4 attempts for each image.


image


I can definitely see merit in adding this feature.

  • In row 1 stable-diffusion forgets the purple trees and portal without guidance.
  • In row 2 none of the images produced zombies and only 1 image had a pumpkin in the centre.
  • In row 3 stable-diffusion failed to produce a separate zombie and teddy bear, it would always be one or the other, or some weird hybrid creature.
  • In row 4 stable-diffusion failed to generate the cloud in all attempts
  • In row 5, all of the stable-diffusion rainbows were vertical for some reason? I guess horizontal rainbows don't happen in real life at such a low height.
  • In row 6 stable-diffusion consistently forget about the bar in the background and in one attempt was missing a squirrel. Both stable-diffusion methods failed to produce a blue boxing glove however, which is surprising. I suspect it's an implementation issue since I don't see anything to separate squirrel A and squirrel B in the code.

(Also, these samples were done without any weighting at all for prompt tokens or paint-with-words tokens. If you reduce the weight of background elements and increase the weight of unique elements I'm sure it'll work even better.)

Here is a solution, but you need to rewrite it to integrate or for extension

@mykeehu It was linked in the initial issue description.

Wow @CookiePPP thank you so much for the comparisons! I would be so happy if you let me use your benchmarks on my paint-with-words repo as well, do you mind if I do?

@cloneofsimo
Feel free to use anything I posted on this thread.

@cloneofsimo Feel free to use anything I posted on this thread.

Thank you so much! I will certainly credit you when I add some of these materials!
And another question: were these generated with default ( 0.4* log(1+sigma)*max(QK)) values?

@cloneofsimo
These were generated with 2.0*log(1+sigma)*std(QK).
I've only tested 14 combinations so be aware this configuration is probably still not the best that can be found.

Awesome! Thank you so much for sharing that information as well! I agree 100% that certainly much better configuration could be possible as the model structure differs with eDiffi.

@CookiePPP Based on your findings, repo I've added user-defined weight scaling function as well as some findings in my repo. I hope this feature gets added to A1111's repo as well

compare_std

I would also be happy if someone wrote it as an extension, because so many models, so many label sets and styles, that would really expand the possibilities of SD. I suppose the interesting thing would be how to do the color label table, because it has to be passed to the generator, the simple prompt would not be good. I also wonder how much can be solved with scripts only, as in the case of multiprompt scripts. Unfortunately I'm not a programmer to rewrite it, I'm just looking forward to it and brainstorming.

commented

The eDiffi ones are painfully good, Nvidia simply has an advantage: they have near unlimited GPU processing power and it only costs them a bit of energy to use it.

commented

One idea for improvement:
What if we'd generate unique noise for each of the word zones ?
Then when re-rolling we could choose which words we want to re-roll and which should stay.

So if the D&D table or the blue gloves are wrong -> reroll only the noise in that area and keep the other noise zones identical.
Of course that would mean quite a few adaptions in the UI, as we'd need a flexible number of noise inputs mapped to all the paint-words with checkboxes for "reroll"

Well, eDiffi does use literally three different conditionings and something like 5 times more parameters, so...

Just in case anyone is interested, there are LOT of room for improvements in my implementation

  1. as @CookiePPP mentioned above, there is no separation within same words
  2. I don't think this hard-coded 0-1 based weights are good, because some words are just tiny bit different but entirely ignored, hence using something like nn-based word similarity to build better, continous cross attention maps

If you are going to rebuild this feature into A1111's, there are probably better ways to construct cross attention weights than mine. (I will be working to get it right eventually though, my repo got way more attention than it deserves lol)

One idea for improvement: What if we'd generate unique noise for each of the word zones ? Then when re-rolling we could choose which words we want to re-roll and which should stay.

So if the D&D table or the blue gloves are wrong -> reroll only the noise in that area and keep the other noise zones identical. Of course that would mean quite a few adaptions in the UI, as we'd need a flexible number of noise inputs mapped to all the paint-words with checkboxes for "reroll"

I think this is very good idea. I think it might work + easy to implement. If this works i'll add this feature in the future as well

I can help with implementing an extension - I just need instructions on what methods would need to changed. I have some ML literacy so just knowing about where the interaction occurs would be enough for me to try an implementation with LDM

I'd love to know what the status of the development of this extension is, because I'm really looking forward to it!

I've tried to collect what needs to be done here, but I'm still trying to understand how webUI and all the related tech works, so this is probably just a vague draft. Can someone fix this/extend upon please?

  • The current version of the paint-with-words-sd code doesn't handle natively ckpt files, just with pre-conversion. Something needs to be done here so it can work with ckpt directly. Is this complicated?

  • The Gradio UI doesn't have a good canvas editor with labeling. It's possible to create a new component for Gradio using the Svelte framework, actually I've found some canvas drawing example here which could server as a basis. But in another Reddit thread someone mentioned a library called konvajs which is mostly dependency free and can be integrated with Svelte rather easily. I've found an example here which allows drawing arbitrary shaped labels over an image, this is close to what we need.

  • The whole thing can be implemented into an extension and appear as a whole new tab in webui, like eg. image browser. Probably some part of the img2img code can be reused for this if I'm not mistaken.

@nistvan86 an alternative solution is what the openOutpaint extension does, that it uses a standalone interface through the api and then it is not tied to gradio.
However, starting from img2img in gradio could possibly be a good base, but you need a textbox or a tabular interface to put the color code in and next to it the prompt line by line to make it an array for input.
I'm not a programmer, just trying to find a solution for this through existing stuff.

@mykeehu an even better solution would be to avoid typing the same prompt elements multiple times and attach the color codes to sections of the prompt with a special syntax, like how you could currently (emphasize things):1.2.

But I'm not sure how well that could be implemented UX wise. One solution could be for example to select a part of the prompt and the drag & drop it onto a colored shape on the canvas.

Or when you choose a colour and start painting with it, the colour is added to the list. But how can the colour be changed afterwards? Or if a color is not needed, delete it, so you can manage layers.

The more I think about it, a vector graphics based editor would probably serve the needs better here. An SVG for example could be shown in the browser easily, it's DOM can be interacted with, the SVG can be saved next to the image in a rather small size (or even embedded into the resulting PNG, so it can be restored the same way you can load back config from an output).
It can be also altered afterwards, polygon's can be moved or the draw order can be changed.
It can be also rendered in whatever size and format needed on the Python side, no need for upscaling a raster image.
(eg. svg.draw.js might work, with CairoSVG on the backend)

Maybe worth noting there's a fork of the paint-with-words repository which uses transformer pipelines: paint-with-words-pipelines.

I've created a minimal example on top of it which can be run on Windows in a venv. (see steps.txt included)
paint-with-words-2.zip

Maybe worth noting there's a fork of the paint-with-words repository which uses transformer pipelines: paint-with-words-pipelines.

I've created a minimal example on top of it which can be run on Windows in a venv. (see steps.txt included) paint-with-words-2.zip

I got this to work with 6 gigs vram.
weirdly enough, the original paint with words didn't work with my PC, and neither did the test.py in the zip.
but when I put made the venv in the old paint folder, the runner.py file worked!

I update the paint with word extension at Paint with Word , combining ControlNet and Paint with Word (PwW).

See the results below
cn_pww_turtle

One can also use pure PwW by setting the weight of ControlNet to 0

@lwchen6309 please check your version, I have a conflict with Controlnet. My bug is here.

@mykeehu thanks for reporting this bug. I raise this issue here for further discussion.

Closing as the extension mentioned above has been available for quite some time. https://github.com/lwchen6309/paint-with-words-sd