Implementation of Non-Linear Independent Component Estimation (NICE & RealNVP) in TF-2

This repository presents an implementation in TensorFlow v2 of NICE and RealNVP models, as described in the 2014 paper authored by Laurent Dinh, David Krueger, and Yoshua Bengio and in the 2016 paper authored by Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio. The NICE model serves as the foundational layer for subsequent normalizing flow models.

Model Explenantion

Main Components

The main idea behind the model is to:

Transform the inital unknown data distribution into a latent space with a known density via an invertible function.
Train the model by maximizing the known likelihood of the mapped data distribution by the change-of-variable rule.
Sample from the known density and invert the sampled points to reconstruct the original data space.

The following is the mathematical implementation of the previously discussed process:

Define the "latent" hidden space known distribution as the product of independent Logistic or Gaussian univariate densities:

$$ \mathbf{h} \sim p_{H}(\mathbf{h}) = \prod_{i}{p_{H_{i}}(h_{i})} $$

Map the initial data distribution to the hidden space distribution via $f$, parametrized by the parameters $\theta$:

Compute the latent representation and density

$$ f: \mathbf{X} \rightarrow \mathbf{H} \Rightarrow f_{\theta}(x) = h $$

$$f^{-1}: \mathbf{H} \rightarrow \mathbf{X} \Rightarrow f^{-1}_{\theta}(h) = x $$$$

$$p_{\mathbf{X}}(\mathbf{x}) = p_{\mathbf{H}}(f_{\theta}(\mathbf{x})) | det(\frac{\partial f_{\theta}(\mathbf{x}) }{\partial x}) | $$$$

Compute the likelihood of the latent space variables $h$ via the change of variable formula:

$$\mathcal{L} ( p_{\mathbf{X}}(\mathbf{x})) = \sum_{i} log (p_{\mathbf{H_{i}}}(f_{\theta}(\mathbf{x}_{i}))) + log (| det(\frac{\partial f_{\theta}(\mathbf{x}) }{\partial x}) |) $$$$

Samples of the initial data distribution are computed by inverting the samples from the hidden space distribution:

$$\mathbf{h} \sim p_{H}(\mathbf{h})$$

$$\mathbf{x} \sim f^{-1}(\mathbf{h})$$

Coupling function (Additive Coupling)

Since $f$ must be invertible in order to evaluate the likelihood, update the parameters, and invert the samples from the base prior distribution, the authors choose to implement and additive coupling rule which takes the following form:

Partition the initial data space into two partitions $x_{a}\in\mathbb{R}^{D-b}$ and $x_{b}\in\mathbb{R}^{D-a}$
Apply a transformation $g$ only on one partition:

$$h_{a} = x_{a}$$

$$h_{b} = x_{a} + g_{\theta}(x_{a})$$

The inverse of this coupling function will be:

$$x_{a} = h_{a}$$

$$x_{a} = h_{b} - g(x_{a})$$

The jacobian of this function is lower triangular and has unit determinant since:

$$\mathbb{J} = \begin{bmatrix} \frac{\partial{h_{a}}}{\partial{x_{a}}} & \frac{\partial{h_{a}}}{\partial{x_{b}}} \\ \frac{\partial{h_{b}}}{\partial{x_{a}}} & \frac{\partial{h_{b}}}{\partial{x_{b}}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{I} & \mathbf{0}\\ \frac{\partial{h_{b}}}{\partial{x_{a}}} & \mathbf{I} \\ \end{bmatrix}$$

and the resulting determinant is:

$$det(\mathbb{J}) = \mathbf{I} \cdot \mathbf{I} + \mathbf{0} \cdot \frac{\partial{h_{b}}}{\partial{x_{a}}} = \mathbf{I}$$

$$log(det(\mathbb{J})) = log(\mathbf{I}) = 0$$

Scaling function

To make the function more flexible the authors propose to multiply the output of the final coupling transformation with an invertible function which is applied element wise:

$$y_{i} = g_{\theta_{i}}(x_{i}) = x_{i} \cdot e^{\theta_{i}}$$

$$x_{i} = g^{-1}_{\theta_{i}}(y_{i}) = y_{i} \cdot e^{-\theta_{i}}$$

The jacobian of this function is diagonal and the resulting determinant is the product of the diagonal components:

$$\mathbb{J} = \begin{bmatrix} \frac{\partial{y_{a}}}{\partial{x_{a}}} & \frac{\partial{y_{a}}}{\partial{x_{b}}} \\ \frac{\partial{y_{b}}}{\partial{x_{a}}} & \frac{\partial{y_{b}}}{\partial{x_{b}}} \\ \end{bmatrix} = \begin{bmatrix} e^{\theta_{11}} & \mathbf{0}\\ \mathbf{0} & e^{\theta_{ii}} \\ \end{bmatrix}$$

$$det(\mathbb{J}) = \prod_{i} e^{\theta_{ii}}$$

$$log(det(\mathbb{J})) = \sum_{i}\theta_{ii}$$

Coupling Function (Affine Coupling)

In the paper RealNVP the authors combined the addition and scaling couplings to jointly learn to translate and scale the base density space with input dependent translation and scaling parameters. The coupling takes the following form:

$$h_{a} = x_{a}$$

$$h_{b} = x_{b} \cdot exp(s_{\theta}(x_{a})) + g_{\theta}(x_{a})$$

The inverse coupling function will be:

$$x_{a} = h_{a}$$

$$x_{b} = (h_{b} - g_{\theta}(x_{a})) \cdot exp(-s_{\theta}(x_{a}))$$

where $g_{\theta}$ and $s_{\theta}$ are neural networks.
The jacobian of this function is lower triangular and has unit determinant since:

$$\mathbb{J} = \begin{bmatrix} \frac{\partial{h_{a}}}{\partial{x_{a}}} & \frac{\partial{h_{a}}}{\partial{x_{b}}} \\ \frac{\partial{h_{b}}}{\partial{x_{a}}} & \frac{\partial{h_{b}}}{\partial{x_{b}}} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{I} & \mathbf{0}\\ \frac{\partial{h_{b}}}{\partial{x_{a}}} & diag(exp(s_{\theta}(x_{a})) \\ \end{bmatrix}$$

$$det(\mathbb{J}) = \mathbf{I} \cdot diag(exp(s_{\theta}(x_{a})) = diag(exp(s_{\theta}(x_{a}))$$

$$log(det(\mathbb{J})) = \sum_{i} s_{\theta}(x_{a})_{i}$$