Model architecture. The encoder part of the UNet uses only standard Resnet
Block with SpatialTransformer to guide the diffusion process with the style embedding
obtained from Es. The middle block and the decoder part use SPADEResBlock, as in
SDM, to encapsulate the semantic mask info. The Mask attention mechanism is applied
inside the SpatialTransformer on the Cross Attention Map.
[Towards Controllable Face Generation with Semantic Latent Diffusion Models]
Our model can generate images in three ways: (a) Given a reference image, (b) Given a reference image but with a specific body part with a random style, (c) Fully noise based without any reference.
Interpolation of eyes, mouth, hair style and full style going from full target
(left) to full reference (right). Some details are highlighted for a clear observation of
changes.
Style transfer comparison between different methods and our model. The style
of the reference image is applied to the target image. The overall consistency in style
swap is far better compared to state-of-the-art methods.
conda activate diffusion
python gradio_img2img.py --dataset CELEBA_HQ_TEST_FOLDER
A suitable conda environment named diffusion
can be created
and activated with:
conda env create -f environment.yaml
conda activate diffusion
To use gradio_img2img.py
download the model from here and put it in the checkpoints
folder and download the VQ-F4 (f=4, VQ (Z=8192, d=3), first row in the table) from the LDM repo following their instructions.