Questions about implementation

Question

Questions about implementation

bonlime opened this issue a year ago · comments

Hi, first of all thanks for a useful library. I've been looking into your implementation of prompt weighting and have questions about it. (i'm only interested in get_embeddings_for_weighted_prompt_fragments function, without blending and etc).

if you have separate function for handling weights < 1, why in the first call to build_weighted_embedding_tensor this weights are also used?
the logic for handling negative cases makes much more sense to me, why not to adapt the same for positive weighs?

i've tried to change your implementation by adapting similar strategy for weights > 1 and it seems to give much more consistent results.
there is another implementation suggestion. currently you're calculating embedding_without_this by removing the weighted piece. it leads to significant change of the whole final embedding. i've observed that if instead you mask the tokens by passing attention_mask to text_encoder the embedding in general is changes less, giving more precise "direction" of the change.

Damian Stewart · Answer 1 · Sun Mar 12 2023 23:32:31 GMT+0800 (China Standard Time)

currently you're calculating embedding_without_this by removing the weighted piece

thanks for the suggestion - yes, i've since become aware of this and it's on the roadmap to change at some point. i did not have much luck using attention_mask (cf huggingface/diffusers#1890), but i was going to try substituting <|pad|> tokens for the omitted tokens instead. but do you have a working example you could share? perhaps a pull request i could merge in?

however I'm not sure i understand the first two questions -

the logic for handling negative cases makes much more sense to me, why not to adapt the same for positive weighs?

which "negative cases" do you mean?

if you have separate function for handling weights < 1 ...

for 1. there isn't a separate function - what's happening here is a blend, so for example
a cat playing with a (ball)0.8 is (roughly) to ("a cat playing with a ball", "a cat playing with a").blend(1, 0.8) and the ball in the first part of the blend has its weight multiplied by 0.8. the weighting 0.8 is applied to build base_embedding, and then an additional embedding without ball is constructed and blended with that.

Damian Stewart · Answer 2 · Wed Mar 15 2023 21:17:06 GMT+0800 (China Standard Time)

ok i understand what you meant with the mask now. that makes a lot of sense, i'll try and get it in for the next release.

Damian Stewart · Answer 3 · Thu Mar 16 2023 17:23:18 GMT+0800 (China Standard Time)

since compel v1.0.0 downweighting now masks rather than removes tokens by default - thanks for the suggestion.

bonlime · Answer 4 · Sat Mar 18 2023 16:40:20 GMT+0800 (China Standard Time)

@damian0815 Glad to see you have adapted the suggestion so fast!

but i was going to try substituting <|pad|> tokens for the omitted tokens instead

this is an interesting idea. since i wrote you i found that while masking is working much better than removing the part of the prompt, one property which is not preserved is that setting weight 0 leads to image result being different from just removing the prompt. so maybe your approach with <|pad|> may me better, have you experimented with it?

also i was thinking about hacky things like calculating embedding for empty prompt "", taking average of it (token-wise) and using the result as substitute for masked tokens. but this is just a thought, i haven't tried it yet