Regarding Alternative Architectures for More Complicated Workflows

Question

Regarding Alternative Architectures for More Complicated Workflows

xiankgx opened this issue 3 months ago · comments

We also introduce several alternative architectures in Fig. 4 for more complicated workflows. We
can add zero-initialized channels to the UNet and use VAE (with or without latent transparency) to
encode foreground, or background, or layer combinations into conditions, and train the model to generate foreground or background (e.g., Fig. 4-(b, d)), or directly generate blended images (e.g.,
Fig. 4-(a, c)).

The base model is a SDXL with LoRA layers. What are these alternative architectures? Is it simply the base model (SDXL with LoRA), then extend the input convolution of the Unet to include more channels?
What are the model weights format in? Is it values difference compared to the base model?
The input to the UNet now is noised latents + additional conditional image latents . What is the order of the latents in the concat list? Are the additional latents noised or unnoised?