[CV_GAN] Generative Adversarial Nets

Question

jeonggg119 opened this issue 3 years ago · comments

GAN : Generative Adversarial Nets
https://jeonggg119.tistory.com/37

Abstract

Estimating Generative models via an Adversarial process
Simultaneously training two models (minimax two-player game)
- Generative model G : capturing data distribution → recovering training data distribution)
- Discriminative model D : estimating probability that a sample came from training DB rather than G → equal to 1/2
G and D are defined by multilayer perceptrons & trained with backprop

The promise of DL : to discover models that represent probability distributions over many kinds of data
The most striking success in DL : Discriminative models that map a high dimensional, rich sensory input to a class label
- based on backprop and dropout
- using piecewise linear units behaved gradient
Deep Generative model : less impact due to..
- difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation
- difficulty of leveraging benefits of piecewise linear units
GAN : training both models using only backprop and dropout & sampling from G using only forward prop
- Generative model G : generating samples by passing random noise through a multilayer perceptron
- Discriminative model D : also defined by a multilayer perceptron
- No need for Markov chains or inference networks

RBMs(restricted Boltzmann machines), DBMs(deep Boltzmann machines) : undirected graphical models with latent variables
DBNs(Deep belief networks) : hybrid models containing a single undirected layer and several directed layers
Score matching, NCE(noise-contrastive estimation) : criteria that don't approximate or bound log-likelihood
GSN(generative stochastic network) : extending generalized DAE -> training G to draw samples from desired distribution

p_g : G's distribution
p_z(z) : Input noise random variables
G : differentiable function represented by MLP -> G(z) : mapping to data space -> output : fake img
D(x) : probability that x came from the train data rather than p_g from G -> output : single scalar

D : maximize probability of assigning correct label to Training examples & Samples from G
- D(x)=1, D(G(z))=0
G : minimize log(1-D(G(z)))
- D(G(z))=1
- Implementation : train G to maximize log(D(G(z))) = stronger gradients early in learning (preventing saturations)

Training criterion allows one to recover data generating distribution as G and D are given enough capacity

[Algorithm 1] k steps of optimizing D and 1 step of optimizing G
- D : being maintained near its optimal solution
- G : changing slowly enough
Loss function for G : min log(1-D(G(z))) => max log(D(z)) for stronger gradients early in training
D is trained to discriminate samples from data, converging to D*(x)=P_d(x)/(P_d(x)+P_g(x))
- D가 Objective function 달성한 optimal state일 때, G가 Objective function 달성하도록 학습
∴ P_g(x) = P_data(x) <=> D(G(z))=1/2

G implicitly defines P_g as distribution of the samples G(z) obtained when z~P_z
[Algorithm 1] to converge to a good estimator of P_data
Non-parametic : representing a model with infinite capacity by studying convergence in space of probability density func
Global optimum for p_g = p_data

[Proposition 1]

[Theorem 1]

[Proposition 2]

If G and D have enough capacity, and at each step of Algorithm 1,
D is allowed to reach optimum given G & P_g is updated to improve criterion → P_g = P_data
pf) V(G, D) = U(P_g, D) : convex function in P_g
Computing a gradient descent update for P_g at optimal D given G
With sufficiently small updates of P_g
Optimizing θ_g rather than P_g itself
Excellent performance of MLP in practice → reasonable model to use despite their lack of theoretical guarantees

Datasets : MNIST, Toronto Face Database(TFD), CIFAR-10
G : ReLU + sigmoid activations / Dropout and other noise at intermediate layers / Noise as input to bottommost layer
D : Maxout activations / Dropout

[Table 1]

Estimation method : Gaussian Parzen window-based log-likelihood estimation for probability of test data

[Figure 2]

Rightmost column : nearest neighboring training sample → Model has not memorized training set
Samples are fair random draws (Not cherry-picked)
Markov chain mixing Sampling process X → Samples are uncorrelated

[Figure 3]

No explicit representation of P_g(x)
D must be synchronized well with G during training (G must be trained too much without updating D)
G collapses too many values of z to same value of x to have enough diversity to model P_data

(1) Computational Advantages

(2) Statistical Advantages from G

conditional GAN p(x|c) : adding c as input to both G and D
Learned approximate inference : training auxiliary network to predict z given x
- Similar to inference net trained by wake-sleep algorithm
- Advantage : inference net trained for a fixed G after G has finished training
All conditionals GAN p(x_S|x_S/) : S is a subset of indices of x by training family of conditional models that share params
- To implement a stochastic extension of deterministic MP-DBM
Semi-supervised learning : when limited labeled data is available
Efficiency improvements : training accelerated by coordinating G and D or determining better distributions to sample z