Compel influencing lora_scale when using LoRA in Diffusers
pietrobolcato opened this issue · comments
Describe the bug
When using compel and prompt embeddings, and performing inference with LoRA weights loaded, lora_scale
doesn't work as expected. Specifically, if I do the following actions in the following order:
- Create a SD pipeline
- Load a model
- Load a LoRA
- Generate an image with
lora_scale
= 1 - Generate an image with
lora_scale
= 0 - Generate an image with
lora_scale
= 1 - Generate an image with
lora_scale
= 1
The image generated in Step 6. is different from the image generated in Step 4. All the following images, if lora_scale
is not changed again, remain consistent. Basically what happens is that somehow it takes one generation to get back on track and remain consistent. See plot attached:
We can see that Image 3
is different from Image 1
, and from Image 4
on, as long as lora_scale
don't change, it remains consistent.
This doesn't happen when not using compel and prompt embeddings:
Reproduction
I prepared a colab that shows the issue, accessible here: https://colab.research.google.com/drive/1ciFZPcvMsNZiZOpfHtLih5V6OyRh8Z6d?usp=sharing
System Info
diffusers[torch]==0.18.1
transformers==4.30.2
compel==1.2.1
strange.
what i'm imagining is that the prompt=
kwargs to the pipeline involve some kind of cleanup/init that you don't benefit from when passing prompt_embeds
.
what happens if you take compel out of the equation but still use prompt_embeds
? i.e. push the prompt through pipe.tokenizer
then take the output of that and push it through pipe.text_encoder
, and then pass that as prompt_embeds
?
The lora scale value is provided at image generation time which isn't going to work for custom prompt embeds. Your image 2 with Compel is also wrong (still having text encoder weights scaled to 1.0 from the previous generation).
Adding this line before using Compel fixes the issue:
pipeline._lora_scale = lora_scale
@pietrobolcato is this still an issue?
which isn't going to work for custom prompt embeds
if load multi lora like,
self.pipe.load_lora_weights(adapter_id_pixel, adapter_name="pixel")
self.pipe.load_lora_weights(adapter_id_chalkboardbrawing, adapter_name="chalkboardbrawing")
self.pipe.set_adapters(["pixel", "chalkboardbrawing"], adapter_weights=[1.0, 1.0])
and then, do generate the image,
sdout_image = self.pipe(prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
num_inference_steps=num_inference_steps,
num_images_per_prompt=num_images_per_prompt,
generator=generator,
height=height,
width=width,
guidance_scale=guidance_scale,
controlnet_conditioning_scale=controlnet_conditioning_scale,
#controlnet_kwargs={"image": can_image},
cross_attention_kwargs={"scale": lora_scale},
control_guidance_start=control_guidance_start,
control_guidance_end=control_guidance_end,
clip_skip=2,
image=can_image,
).images[0]
how to deal with it ?
because there is multi lora
and ,one more question. The lora tagger's words in prompt and negative prompt's text-inversion embedding with tagger's word, how do the "Comple" affect these trigger words?And, i do some comparative experiment between the "stable diffusion webui" and the "diffusers inference", but the result is very bad in "diffusers inference" and the "stable diffusion webui" is normal and good .
thanks.
looking forward to reply.
and ,one more question. The lora tagger's words in prompt and negative prompt's text-inversion embedding with tagger's word, how do the "Comple" affect these trigger words?And, i do some comparative experiment between the "stable diffusion webui" and the "diffusers inference", but the result is very bad in "diffusers inference" and the "stable diffusion webui" is normal and good .
thanks. looking forward to reply.
yes, i've found same issue
and i think this is not fully related to compel
without compel, the quality still degraded compare to sd-webui
same problem.
@damian0815 @pdoane
and ,one more question. The lora tagger's words in prompt and negative prompt's text-inversion embedding with tagger's word, how do the "Comple" affect these trigger words?And, i do some comparative experiment between the "stable diffusion webui" and the "diffusers inference", but the result is very bad in "diffusers inference" and the "stable diffusion webui" is normal and good .
thanks. looking forward to reply.yes, i've found same issue and i think this is not fully related to compel without compel, the quality still degraded compare to sd-webui
same promble.
yes, the diffuser's result is not good or better than the a111-webui.