[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR)

Question

[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR)

andrewjong opened this issue 4 years ago · comments

0. Article Information and Links

Paper link: https://seanseattle.github.io/SMIS/
Release date: YYYY/MM/DD
Number of citations (as of 2020/MM/DD):
Implementation code: https://github.com/Seanseattle/SMIS
Supplemental links (e.g. results):
Publication: CVPR 2020

1. What do the authors try to accomplish?

A general framework that uses segmentation maps to semantically edit an image (like GauGAN), BUT only make localized edits that preserve the rest of the image. Traverse the latent space of that isolated label of the segmentation map.

2. What's great compared to previous research?

Introduces new applications:

Appearance mixture (new!)

Semantic Manipulation (old, previously seen in GauGAN)

Style Morphing (new!)

3. Where are the key elements of the technology and method?

Note this paper builds on top of the GauGAN paper (link to my notes).

Architecture Overview: GroupDNet

Standard ConvNet entangles featuremaps, which would prevent localized editing. Using Grouped Convolutions enables class independence.

Class-specific latent code

Latent code is broken into class-specific latent codes. 1 code per label in the segmentation map, e.g. 19 cloth labels --> 19 latent codes. (This seems to imply that we must have separate models for each semantic map type.)

The latent code is created by encoding the input image with encoding layers.

The input image is split into C semantically segmented parts. The encoder uses C Group Convolutions which effectively operate on only their relevant segmentation of the image.

The latent code is forced to look Gaussian N(0,1) using KL divergence. Using this Gaussian-like latent code input enables the user to cleanly walk the latent space for each class.

Modify SPADE normalization to work for GroupConvs

Replace SPADE's Convs with GroupConvs, call this Conditional Group Normalization.

The Conditional Group Block is akin to SPADE's ResBlk variant, but using the proposed CG-Norm instead.

Loss

LGAN is hinge version of GAN loss
LFM is feature matching loss between real and synthesized image; multi-layer discriminator extracts features from real and synthesized
LP is the VGG perceptual loss
LKL is KL divergence of the latent code from Gaussian N(0,1)

4. How did you verify that it works?

5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

Authors note that this architecture is very SLOW.
Vary the shape in addition to the texture

andrewjong / Deep-Learning-Paper-Surveys

[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR)

0. Article Information and Links

1. What do the authors try to accomplish?

2. What's great compared to previous research?

3. Where are the key elements of the technology and method?

Architecture Overview: GroupDNet

Class-specific latent code

Modify SPADE normalization to work for GroupConvs

Loss

4. How did you verify that it works?

5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

6. Are there any papers to read next?

7. References