SDXL support

Question

SDXL support

bghira opened this issue a year ago · comments

Trying to use the provided example with SDXL causes an error in Diffusers:

<class \'diffusers.models.unet_2d_condition.UNet2DConditionModel\'> has the config param `addition_embed_type` set to \'text_time\' which requires the keyword argument `text_embeds` to be passed in

is there a way you've determined that it might work, yet?

Patrick von Platen commented a year ago

See: #41

Damian Stewart · Answer 1 · Mon Jul 03 2023 03:49:12 GMT+0800 (China Standard Time)

interesting. i don't have access to SDXL weights so cannot really say anything, but yeah, it's sorta not surprising that it doesn't work. if you can get a hold of the two separate text encoders from the two separate models, you could try making two compel instances (one for each) and push the same prompt through each, then concatenate before passing on the unet. but i'm just guessing.

Bagheera · Answer 2 · Mon Jul 03 2023 04:01:54 GMT+0800 (China Standard Time)

if you'd like to write up a small example, i can try it out and break things til it works.

Damian Stewart · Answer 3 · Mon Jul 03 2023 16:48:38 GMT+0800 (China Standard Time)

looks like you won't even have to concat the embeddings, so something like this ought to work (ref huggingface/diffusers#3859):

base_pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9")
refiner_pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9")

compel_base = Compel(base_pipeline.tokenizer, base_pipeline.text_encoder)
compel_refiner = Compel(refiner_pipeline.tokenizer, refiner_pipeline.text_encoder)

prompt = "a cat playing with a ball in the forest"
embeds_base = compel_base(prompt)
embeds_refiner = compel_refiner(prompt)

images_base = base_pipeline(prompt_embeds=embeds, ...) # may need to be `text_embeds` instead of `prompt_embeds`
images_refined = refiner_pipeline(prompt_embeds=embeds, image=images_base.images) # may need to be `text_embeds`

Patrick von Platen · Answer 4 · Mon Jul 03 2023 22:40:37 GMT+0800 (China Standard Time)

Also happy to play around with this a bit to get it working

Bagheera · Answer 5 · Mon Jul 03 2023 22:57:40 GMT+0800 (China Standard Time)

it would be excellent @patrickvonplaten if we (the royal we, as in, you guys :D ) can update the call() function for SDXL to take the same parameter names, eg. prompt_embeds and negative_prompt_embeds. it is a constant issue for consistency where some have it plural or not, and i'd rather one less conditional be needed 🙏

Bagheera · Answer 6 · Wed Jul 05 2023 08:36:19 GMT+0800 (China Standard Time)

@damian0815 the problem is that the base has two text encoders.

Damian Stewart · Answer 7 · Wed Jul 05 2023 16:04:17 GMT+0800 (China Standard Time)

ahh so it's not that simple. ok

Bagheera · Answer 8 · Wed Jul 05 2023 23:57:35 GMT+0800 (China Standard Time)

I use this kind of wrapper class and have modified it to pull the Refiner's text encoder if it is found, so you can see how that works.

i will likely update this wrapper to handle a scenario whether there are two text encoders automatically, but i'm not yet certain how to form the returned embeds or how to pass them into the pipeline.

from discord_tron_client.classes.app_config import AppConfig

import logging

config = AppConfig() # Just a helper class that manages a json.
from compel import Compel, ReturnedEmbeddingsType

# Manipulating prompts for the pipeline.
class PromptManipulation:
    def __init__(self, pipeline, device, use_second_encoder_only: bool = False):
        if not config.enable_compel():
            return
        self.is_valid_pipeline(pipeline)
        self.pipeline = pipeline
        if (self.has_dual_text_encoders(pipeline) and not use_second_encoder_only):
            # SDXL Refiner and Base can both use the 2nd tokenizer/encoder.
            logging.debug(f'Initialising Compel prompt manager with dual encoders.')
            self.compel = Compel(
                tokenizer=[
                    self.pipeline.tokenizer,
                    self.pipeline.tokenizer_2
                ],
                text_encoder=[
                    self.pipeline.text_encoder,
                    self.pipeline.text_encoder_2
                ],
                truncate_long_prompts=True,
                returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
                requires_pooled=[
                    False,  # CLIP-L does not produce pooled embeds.
                    True    # CLIP-G produces pooled embeds.
                ]
            )
        elif (self.has_dual_text_encoders(pipeline) and use_second_encoder_only):
            # SDXL Refiner has ONLY the 2nd tokenizer/encoder, which needs to be the only one in Compel.
            logging.debug(f'Initialising Compel prompt manager with just the 2nd text encoder.')
            self.compel = Compel(
                tokenizer=self.pipeline.tokenizer_2,
                text_encoder=self.pipeline.text_encoder_2,
                truncate_long_prompts=True,
                returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
                requires_pooled=True
            )
        else:
            # Any other pipeline uses the first tokenizer/encoder.
            logging.debug(f'Initialising the Compel prompt manager with a single text encoder.')
            pipe_tokenizer = self.pipeline.tokenizer
            pipe_text_encoder = self.pipeline.text_encoder
            self.compel = Compel(
                tokenizer=pipe_tokenizer,
                text_encoder=pipe_text_encoder,
                truncate_long_prompts=True,
                returned_embeddings_type=ReturnedEmbeddingsType.LAST_HIDDEN_STATES_NORMALIZED,
            )
    def should_enable(self, pipeline, user_config: dict = None):
        if (type(pipeline).__name__ == "KandinskyV22Pipeline"):
            # KandinskyV22Pipeline doesn't use the prompt manager.
            return False
        if user_config is not None and "DeepFloyd" in user_config.get('model', ''):
            # Does not work for DeepFloyd.
            return False
        return True

    def has_dual_text_encoders(self, pipeline):
        return hasattr(pipeline, "text_encoder_2")

    def is_sdxl_refiner(self, pipeline):
        # SDXL Refiner has the 2nd text encoder, only.
        if self.pipeline.tokenizer is None and hasattr(self.pipeline, "tokenizer_2"):
            return True
        return False

    def is_valid_pipeline(self, pipeline):
        if not hasattr(pipeline, "tokenizer") and not hasattr(
            pipeline, "tokenizer_2"
        ):
            raise Exception(
                f"Cannot use PromptManipulation on a model without a tokenizer."
            )

    def process_long_prompt(self, positive_prompt: str, negative_prompt: str):
        batch_size = config.maximum_batch_size()
        if self.has_dual_text_encoders(self.pipeline):
            logging.debug(f'Running dual encoder Compel pipeline for batch size {batch_size}.')
            # We need to make a list of positive_prompt * batch_size count.
            positive_prompt = [positive_prompt] * batch_size
            conditioning, pooled_embed = self.compel(positive_prompt)
            negative_prompt = [negative_prompt] * batch_size
            negative_conditioning, negative_pooled_embed = self.compel(negative_prompt)
        else:
            logging.debug(f'Running single encoder Compel pipeline.')
            conditioning = self.compel.build_conditioning_tensor(positive_prompt)
            negative_conditioning = self.compel.build_conditioning_tensor(negative_prompt)
        [
            conditioning,
            negative_conditioning,
        ] = self.compel.pad_conditioning_tensors_to_same_length(
            [conditioning, negative_conditioning]
        )
        if self.has_dual_text_encoders(self.pipeline):
            logging.debug(f'Returning pooled embeds along with positive/negative conditionings.')
            return conditioning, negative_conditioning, pooled_embed, negative_pooled_embed
        return conditioning, negative_conditioning
# Path: discord_tron_client/classes/image_manipulation/diffusion.py

Adhik Joshi · Answer 9 · Sat Jul 08 2023 21:19:10 GMT+0800 (China Standard Time)

Anyone got it working?

Bagheera · Answer 10 · Wed Jul 19 2023 08:24:43 GMT+0800 (China Standard Time)

@damian0815 @patrickvonplaten it seems like num_inference_per_prompt is incorrectly handled.

from compel import Compel, ReturnedEmbeddingsType
from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", use_safetensors=True, torch_dtype=torch.float16).to("cuda")
compel = Compel(truncate_long_prompts=False, tokenizer=[pipeline.tokenizer, pipeline.tokenizer_2] , text_encoder=[pipeline.text_encoder, pipeline.text_encoder_2], returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED, requires_pooled=[False, True])
# upweight "ball"
prompt = "a cat playing with a ball++ in the forest"
conditioning, pooled = compel(prompt)
# generate image
image = pipeline(prompt_embeds=conditioning, pooled_prompt_embeds=pooled, num_inference_steps=30, num_images_per_prompt=4).images[0]

output:

The size of tensor a (14400) must match the size of tensor b (3600) at non-singleton dimension 1

I am clever enough to realise that this is 3600 * 4 = 14400

I tried to make a [list] of prompts, as I saw that the tensors would be concatenated when they are submitted in this way. However the same error occurs.

Bagheera · Answer 11 · Wed Jul 19 2023 08:39:53 GMT+0800 (China Standard Time)

if I produce compel([prompt] * batch_size) embeds and pass them with num_images_per_prompt=1 then I get a working result with multiple images! interesting. the other pipelines aren't like that, i assume the pipeline needs a issue report?

Self · Answer 12 · Sat Jul 29 2023 02:21:28 GMT+0800 (China Standard Time)

Question regarding long prompts+ negative prompts for SDXL and the use of the Refiner in combination with Compel:

1: on build_conditioning_tensor and pad_conditioning_tensors_to_same_length

Here is some older SD1.5 code, Is pad_conditioning_tensors_to_same_length still needed? Would I need to chunk up long prompts to prevent truncation somehow using SDXL or does truncate_long_prompts=False handle everything and I don't have to do anything?

conditioning = compel.build_conditioning_tensor(prompt)
negative_conditioning = compel.build_conditioning_tensor(negative_prompt)
conditioning, negative_conditioning = compel.pad_conditioning_tensors_to_same_length([conditioning, negative_conditioning])

2: on Refiner Usage

Sorry if this has been stated already: The pooled prompt embeds seem to be needed for the Refiner too ? I see some prototype code above by @bghira , I wonder if this is working?

Is it possible to have a short documentation on how to use the refiner + base model with very long prompts ?

The below was just a naive guess from me how and seems to work, using the tokenizer of the base model, or should I use the one from the refiner or doesn't it make a difference? The use of text_encoder_2=pipeline.text_encoder_2 in the refiner definition suggests that they are indeed the same:

from diffusers import DiffusionPipeline
import torch
from compel import Compel, ReturnedEmbeddingsType


use_refiner = True

pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=pipeline.text_encoder_2,
    vae=pipeline.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

compel = Compel(truncate_long_prompts=False, tokenizer=[pipeline.tokenizer, pipeline.tokenizer_2] , text_encoder=[pipeline.text_encoder, pipeline.text_encoder_2], returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED, requires_pooled=[False, True])

compel_refiner = Compel(truncate_long_prompts=False, tokenizer=pipeline.tokenizer_2 , text_encoder=pipeline.text_encoder_2, returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED, requires_pooled=[False, True])



conditioning_refiner, refiner_pooled_positive  = compel_refiner(prompt)
negative_conditioning_refiner , refiner_pooled_negative = compel_refiner(negative_prompt)


image_base = pipeline(prompt_embeds=conditioning, negative_prompt_embeds=negative_conditioning, pooled_prompt_embeds=pooled, negative_pooled_prompt_embeds=negative_pooled, num_inference_steps=30, output_type="latent" if use_refiner else "pil").images[0]

image = refiner(prompt_embeds=conditioning_refiner, negative_prompt_embeds=negative_conditioning_refiner, pooled_prompt_embeds=refiner_pooled_positive, negative_pooled_prompt_embeds=refiner_pooled_negative, num_inference_steps=30, image=image_base[None, :]).images[0]

Bagheera · Answer 13 · Sat Jul 29 2023 04:29:01 GMT+0800 (China Standard Time)

@BEpresent i have updated my code there.

Damian Stewart · Answer 14 · Sun Jul 30 2023 01:53:28 GMT+0800 (China Standard Time)

sorry @BEpresent i've been unable to get SDXL working on my local system and haven't found time to setup a vast.ai remote debugging environment. i pushed compel 2.0.1 which should mean issues with pad_conditioning_tensors_to_same_length should work at least to generate an embedding of the right shape for the base model ([1, 77, 2048] or eg [1, 154, 2048] if you have a prompt needs 150 tokens instead of 75). if the refiner uses the same shape as the base then that should just work.

i just want to say to both of you though, @bghira and @BEpresent - i've been doing most of my experiments using SD2.1 and because it uses OpenCLIP, which is a more precise embedder, long prompts actually result in shittier generations, and shorter, straightforward English sentences, actually give better results.

Since SDXL uses both OpenCLIP and OpenAI CLIP in tandem, you might want to try being more direct with your prompt strings. rather than just pooping out 10 million vague fuzzy tags, just write an english sentence describing the thing you want to see. OpenAI CLIP sucks at giving you that, but OpenCLIP is actually very good at it.

Bagheera · Answer 15 · Sun Jul 30 2023 02:22:15 GMT+0800 (China Standard Time)

it's just got token bleed issues if you don't use prompt segmentation. i do not use long incoherent prompts, i use short ones, such as:

('the pope', 'dressed as ronald mcdonald').and(0.9, 0.95)

this tends to help overcome the imbalanced training data in the SDXL pile which ended up with heavier weights on certain subjects, making them inflexible.

Damian Stewart · Answer 16 · Sun Jul 30 2023 02:41:24 GMT+0800 (China Standard Time)

ahh i see @bghira so really what you're needing is the .and() support rather than truncating per se. that makes complete sense.

Damian Stewart · Answer 17 · Sun Aug 20 2023 23:20:14 GMT+0800 (China Standard Time)

i think i can close this..?