BUG: weights inside CLIP_L(...) don't parse correctly

Question

BUG: weights inside CLIP_L(...) don't parse correctly

asagi4 opened this issue 7 months ago · comments

Alexander Brown · Answer 1 · Fri Jan 05 2024 09:20:17 GMT+0800 (China Standard Time)

Thank you!

diaopal · Answer 2 · Fri Jan 05 2024 14:41:58 GMT+0800 (China Standard Time)

clip_l and clip_g gets torch concatenated as seen in https://github.com/comfyanonymous/ComfyUI/blob/6d281b4ff4ad3918a4f3b4ca4a8b547a2ba3bf80/comfy/sdxl_clip.py#L52-L56
so which weights do you use for the final conditioning?

asagi4 · Answer 3 · Sat Jan 06 2024 01:39:02 GMT+0800 (China Standard Time)

@mizukarada as far as I can tell that encoding happens outside anything my nodes touch.

The logic is the same as in CLIPTextEncodeSDXL

The node just encodes the l and g tokens and then calls clip.encode_from_tokens (or its equivalent from ADV_CLIP_emb if that's in use). What that does is up to ComfyUI.

diaopal · Answer 4 · Sat Jan 13 2024 10:36:57 GMT+0800 (China Standard Time)

@asagi4 For example,

clip_l: cat AND dog :1.2

clip_g: apple AND orange :1.3

clip_l and clip_g gets merged into one we'll call emb

what is emb's weight? Is it 1.2 or 1.3? or the average of the two? mind you, there can also be multiple weights in one text.

asagi4 · Answer 5 · Mon Jan 15 2024 01:03:18 GMT+0800 (China Standard Time)

@mizukarada I don't think you can do that with my nodes; AND combining of prompt is processed after any clip_l / clip_g separation, since the l / g distinction happens at the token level and disappears once they've been encoded into tensors (though there's the "pooled" vs. non-pooled tensors, but my nodes basically treat them identically)

AND inside the CLIP_L function doesn't make sense. CLIP_L(foo AND bar) will essentially parse CLIP_L(foo and bar) as two prompts