andrewjong / Deep-Learning-Paper-Surveys

Personal repository to track my paper reading

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[SMIS] Semantically Multi-modal Image Synthesis (Apr 2020 CVPR)

andrewjong opened this issue · comments

0. Article Information and Links

1. What do the authors try to accomplish?

A general framework that uses segmentation maps to semantically edit an image (like GauGAN), BUT only make localized edits that preserve the rest of the image. Traverse the latent space of that isolated label of the segmentation map.

2. What's great compared to previous research?

Introduces new applications:

  1. Appearance mixture (new!)

CG-Norm

  1. Semantic Manipulation (old, previously seen in GauGAN)

CG-Norm

  1. Style Morphing (new!)

CG-Norm

3. Where are the key elements of the technology and method?

Note this paper builds on top of the GauGAN paper (link to my notes).

Architecture Overview: GroupDNet

Standard ConvNet entangles featuremaps, which would prevent localized editing. Using Grouped Convolutions enables class independence.

image

Class-specific latent code

Latent code is broken into class-specific latent codes. 1 code per label in the segmentation map, e.g. 19 cloth labels --> 19 latent codes. (This seems to imply that we must have separate models for each semantic map type.)

The latent code is created by encoding the input image with encoding layers.
CG-Norm
The input image is split into C semantically segmented parts. The encoder uses C Group Convolutions which effectively operate on only their relevant segmentation of the image.

The latent code is forced to look Gaussian N(0,1) using KL divergence. Using this Gaussian-like latent code input enables the user to cleanly walk the latent space for each class.

Modify SPADE normalization to work for GroupConvs

Replace SPADE's Convs with GroupConvs, call this Conditional Group Normalization.

CG-Norm

The Conditional Group Block is akin to SPADE's ResBlk variant, but using the proposed CG-Norm instead.

Loss

image

  • LGAN is hinge version of GAN loss
  • LFM is feature matching loss between real and synthesized image; multi-layer discriminator extracts features from real and synthesized
  • LP is the VGG perceptual loss
  • LKL is KL divergence of the latent code from Gaussian N(0,1)

4. How did you verify that it works?

5. Things to discuss? (e.g. weaknesses, potential for future work, relation to other work)

  • Authors note that this architecture is very SLOW.
  • Vary the shape in addition to the texture

6. Are there any papers to read next?

7. References