salesforce / EDICT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prompt to prompt editing

andreaferretti opened this issue · comments

I am trying to understand to what extent your method replaces prompt-to-prompt. It seems to me that EDICT is a clever way to invert DDIM diffusion. If so, once we get our latents, we should be able to apply prompt to prompt editing techniques. Instead, what you propose is just to run DDIM denoising conditioned on the target prompt to obtain the edited image.

It has been observed that (on generated images) prompt-to-prompt obtains more realistic and semantically meaningful edits. I guess the technique should be readily applicable to a latent obtained by EDICT inversion - and the code seems to support it - but the paper does not mention this combination, and in fact setting use_p2p=True gives me inferior results.

Do you have an explanation why using prompt-to-prompt is not beneficial?

commented

Hi, thanks for the question!

This surprised us too. I love P2P and thought it'd boost our results. We don't have a full explanation for it (we haven't focused on this a ton but I've dug through the code to double-check that things are wired correctly), but typically what I see with EDICT+P2P is some combo of

  1. The image remains overly faithful to the original

  2. The image becomes unrealistic

  3. Is a bit easier to explain imo. As we show in Figure 4 in the paper, the generative process can be delicate to perturbations; that's why we need averaging layers. It's fairly intuitive that putting another constraint on the process could mess things up.

The puzzling thing about 1. is that P2P clearly works in something like null-text inversion. So again it must be something EDICT-specific. One hypothesis is that the combination of the averaging layers with predictions operating on the counterpart sequence (e.g. x to y) dampens the amount of change that can be made when attention maps are constrained. It definitely makes the concept of self-attention more awkward.

It's possible that softening the locking of attention maps to re-weighting or being more selective in their application (or customizing them to EDICT) could work. This definitely is an area we want to keep thinking about so I'm curious if you have any further insight (experimental or otherwise). Happy to have follow-up discussions!

After some more experiments, I start finding the P2P interface too restrictive for general use, so I am not sure I would use that with EDICT even if it was available. Putting just any target prompt is so much more convenient.

Anyway, I don't have good explanations. I actually rewrote the P2P part to use the official Prompt to Prompt implementation, but I never got any good results with that

I am sorry I can't, it is part of a proprietary codebase