YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)

Home Page:https://arxiv.org/abs/2401.11708

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is It Resizing or Just Fusion at Corresponding Positions?

Lne27 opened this issue · comments

I'd like to ask, during the stage of regional latent space fusion in different areas, is this method really resizing to the corresponding positions? Looking at the code, it seems that only the latent spaces of the corresponding positions in each regional image are fused, which is quite confusing?

Yes, I have the same question. The latent feature of the sub-region is directly cropped and not resized.

out = out[:,int(latent_h*drow.start) + addout:int(latent_h*drow.end),
int(latent_w*dcell.start) + addin:int(latent_w*dcell.end),:]

Then, the cropped features are fused with the corresponding positions of the base latent features.
if self.usebase :
# outb_t = outb[:,:,int(latent_w*drow.start):int(latent_w*drow.end),:].clone()
outb_t = outb[:,int(latent_h*drow.start) + addout:int(latent_h*drow.end),
int(latent_w*dcell.start) + addin:int(latent_w*dcell.end),:].clone()
out = out * (1 - dcell.base) + outb_t * dcell.base

It seems not resized as the paper say. And I'd like to know why this is done, is it because resize doesn't make sense?