The idea is to allow the generation of high-resolution images using diffusion models by training them over some lower-resolution quantized/latent space rather than training over the original image space. Then allowing some autoencoder's decoder to rescale this lower-resolution quantized image to the original image resolution.
The CelebHQ Dataset
Dataset of $30,000$ celebrity faces each of size $(256 \times 256)$.
AutoEncoder (VQVAE) trained
Trained for 20 epochs with 25 mins per epoch (Around 9 hr training).
UNet-based autoencoder architecture with quantized/latent representation of size 1/8 of the original image. Reconstruction using L2 + Perceptual + Adversarial Losses.
Reverse Diffusion Process over the Quantized Images and Upsampling to original using Decoder of VQVAE
Trained for 100 epochs for 8 x 8 quantized images with 4:30 mins per epoch. (Around 8:30 hr training)
Final Generated Images
DDPM Implementation
Reverse Diffusion Process over MNIST Digits
Trained for 30 epochs for 28 x 28 images. (Around 1:30 hr training)
Reverse Diffusion Process over CelebA Faces
Trained for 10 epochs for 32 x 32 images. (Around 2:30 hr training)
Denoising Diffusion Probabilistic Models
(Posted Part I: 7 Jun 2024) (Posted Part II: 8 Jun 2024) Apologies for some formatting issues here, so check out blogs at Blog Page. Each equation here has to be tweaked in some way in order to render it.
As seen in the case of Variational Autoencoders, it all boils down to learning the probability distributions - $p(\textbf{z} | \textbf{x})$ the posterior abstraction of obtaining a hidden representation $\textbf{z}$ given some input image $\textbf{x}$ and the likelihood $p(\textbf{x} | \textbf{z})$ of generating the image samples given some hidden representation $\textbf{z}$.
Now the most crucial task in all these generative models is trying to understand to relate the objective that we are trying to acheive and what the model actually learns. We'll see the same confusing conclusion being established by the end of this blog and then we'll realise how beautifully all the mathematics and the tasks laid out make sense.
Throwback to Variational Autoencoders
Like in the case of VAEs, we started off by approximating the actual $P(\textbf{z} | \textbf{x})$ through our probabilistic Encoder $Q_{\phi}(\textbf{z} | \textbf{x})$ and minimising the KL divergence between these two. But in order to establish the knowledge of the actual $P(\textbf{z} | \textbf{x})$, we went into maximising the log-likelihood of data samples $\textbf{x}$ and eventually made the encoder learn this distribution $Q_{\phi}(\textbf{z} | \textbf{x})$ to be as close to the standard normal $\mathcal{N}(\textbf{0}, \mathbb{I})$ as possible. Hence now drawing any $\textbf{z} \sim \mathcal{N}(\textbf{0}, \mathbb{I})$ we are sure of it being close to the $\textbf{z}$'s seen during training, allowing us to discard off the encoder entirely at inference. The way we setup the objective of making the actual and approximated distribution close to each other will stay same for Diffusion Models too and this would allow us to uncover more truth about the actual distribution itself.
Graphical Model of a Variation Autoencoder.
What are Diffusion Models?
For Diffusion Models instead of one latent variable $\textbf{z}$, we have $T$ latent variables of the form $\textbf{x}_1, \textbf{x}_2, \cdots , \textbf{x}_T$ of same dimension as the input image $\textbf{x}_0$, and the most interesting point is that the forward noising process is a deterministic Markov Chain, wherein Gaussian noise is added in gradual $T$ steps, defined as:
Here the variances are controlled by a scheduler $\left \{ \beta_t \in (0, 1) \right \}_{t = 1}^T$, which means for each noisy sample $\textbf{x}_t$ is sampled from a Gaussian with $\mathbf{\mu}_q = \sqrt{1 - \beta_{t}} \textbf{x}_{t - 1}$ and covariance matrix $\mathbf{\Sigma}_q = \beta_{t} \mathbb{I}$. The idea is then to learn the reverse denoising diffusion distribution $q(\textbf{x}_{t - 1} | \textbf{x}_t)$, which is also a Markov Chain with learned Gaussian transitions starting at $p(\textbf{x}_T) \sim \mathcal{N}(\textbf{x}_T; \textbf{0}, \mathbb{I})$. Therefore it then becomes really important to understand the entire joint distribution $p(\textbf{x}_1, \textbf{x}_2, \cdots, \textbf{x}_T)$ denoted in shorthand as $p(\textbf{x}_{0:T})$.
Graphical Model of a Diffusion Process.
Prerequisites
Joint & Conditional Distribution of $N$ RVs and Bayes' Rule
A joint distribution over $N$ random variables assigns probabilities to all the events involving these $N$ random variables^[$k^N$ values if each RV can take $k$ values], denoted as
$$ P(X_1, X_2, X_3, \cdots, X_N) $$
Now starting off with just two RVs, the conditional probabilities $P(X_1 | X_2)$ and $P(X_2 | X_1)$ can be calculated from the joint distribution as:
the utility of this is that using the chain rule of joint probability, we may simply write for all our diffusion process forward steps $q(\textbf{x}_{1 : T} | \textbf{x}_0)$ as
Diffusion is the process of converting samples from a complex distribution (the data here) $\textbf{x}_0 \sim q(\textbf{x}_0)$ to samples of a simple distribution (isotropic Gaussian noise) $\textbf{x}_T \sim \mathcal{N}(\textbf{0}, \mathbb{I})$. One can also observe that there is a $\color{purple}{\text{deterministic}}$ and a $\color{blue}{\text{stochastic}}$ component even in our case. Since any RV can be reparametrized as $Z = \sigma X + \mu$, hence we denote the $\textbf{x}_t$ being drawn from $q(\textbf{x}_t | \textbf{x}_{t - 1})$ as
One might wonder why does following the above said markov chain of gaussians lead to $\textbf{x}_T \sim \mathcal{N}(\textbf{0}, \mathbb{I})$. To understand this let's take arbitrary constants for the above
We can combine the independent Gaussians into one Gaussian (Two Gaussians with different variances, $\mathcal{N}(\textbf{0}, \sigma_1^2\mathbb{I})$ and $\mathcal{N}(\textbf{0}, \sigma_2^2\mathbb{I})$ can be merged to a new Gaussian distribution $\mathcal{N}(\textbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbb{I})$) as they have variances as $(\beta \alpha^{T - 1}, \beta \alpha^{T - 2}, \cdots, \beta \alpha, \beta)$ with $\sigma^2 = \beta \frac{1 - \alpha^T}{1 - \alpha}$. Notice that as $T \to \infty, (\sqrt{\alpha})^T \to 0$ and $\textbf{x}_T \to \mathcal{N}(\textbf{0}, \mathbb{I})$ only when $\alpha = 1 - \beta$.
Do we traverse for all $T$ steps?
Certainly Not! Here's how the Markov Process allows us to reach any $\textbf{x}_t$ from the image $\textbf{x}_0$. Let $\alpha_t = 1 - \beta_t$
If we can reverse the above process and sample from $q(\textbf{x}_{t - 1} | \textbf{x}_t)$, we will be able to recreate the true sample from a Gaussian noise input, $\textbf{x}_T \sim \mathcal{N}(\textbf{0}, \mathbb{I})$ . Note that if $\beta_t$ is small enough, $q(\textbf{x}_{t - 1} | \textbf{x}_t)$ will also be Gaussian. Unfortunately, we cannot easily estimate $q(\textbf{x}_{t - 1} | \textbf{x}_t)$ because it needs to use the entire dataset and therefore we need to learn a model $p_{\theta}$ to approximate these conditional probabilities in order to run the reverse diffusion process. The actual reverse distribution
Before moving onto defining the objective to find the approximate $p_{\theta}(\textbf{x}_{t - 1} | \textbf{x}_t)$, its noteworthy to understand the actual reverse process distribution $q(\textbf{x}_{t - 1} | \textbf{x}_t)$. As stated by the DDPM paper, the reverse conditional distribution is tractable when condition on $\textbf{x}_0$ and since this a Markov Process, we can safely introduce this $\textbf{x}_0$ in the joint conditional part and then we may expand it by Bayes' Rule
Notice that all of these are forward processes, and using $\mathcal{N}(\textbf{x}; \boldsymbol{\mu}, \boldsymbol{\sigma}^2) \propto \text{exp}(-\frac{1}{2} \frac{(\textbf{x} - \boldsymbol{\mu})^2}{\boldsymbol{\sigma}^2})$
As discussed before we will follow the same methodology as done in VAEs of learning the approximate reverse distribution $p_{\theta}(\textbf{x}_{t - 1} | \textbf{x}_t)$ by maximizing the expected log-likelihood of the observed data $p_{\theta}(\textbf{x}_0)$ for $\textbf{x}_0 \sim q(\textbf{x}_0)$
Jenson's Inequality.
Using Jensen's Inequality over the $\log$ function (convex function), hence the expectation of $\log$ is lesser than equal to the $\log$ of expectation.
Notice that both these terms are joint probability distributions with $\color{OrangeRed}{q(\textbf{x}_{1 : T} | \textbf{x}_0)}$ being the actual forward process and $\color{OrangeRed}{p_{\theta}(\textbf{x}_{0 : T})}$ the approximate reverse process. Expanding these terms out
Further we'll condition the forward process on $\textbf{x}_0$ as it would later allow us to expand terms using Bayes' Rule$q(\textbf{x}_t | \textbf{x}_{t - 1}, \textbf{x}_0) = \frac{q(\textbf{x}_{t - 1} | \textbf{x}_t, \textbf{x}_0) \cdot q(\textbf{x}_t | \textbf{x}_0)}{q(\textbf{x}_{t - 1} | \textbf{x}_0)}$
The above loss function aims to bring the actual reverse $q(\textbf{x}_{t - 1} | \textbf{x}_t)$ and approximated reverse distributions $p_{\theta}(\textbf{x}_{t - 1} | \textbf{x}_t)$ as close as possible by the means of the $T - 1$$KL$ Divergence terms. Since we approximate the reverse process using a neural network, the Divergence terms would imply that we want their means to be as close as possible. For two Gaussians with same covariance matrices $p = \mathcal{N}(\textbf{x}; \boldsymbol{\mu}_1, \mathbf{\Sigma})$ and $q = \mathcal{N}(\textbf{x}; \boldsymbol{\mu}_2, \mathbf{\Sigma})$, their $D_{KL}(p \parallel q) = \frac{1}{2 \lVert \mathbf{\Sigma} \rVert^2_2} \lVert \boldsymbol{\mu}_1 - \boldsymbol{\mu}_2 \rVert^2$
The authors, however, define this in terms of the noise prediction. Since $\boldsymbol{\mu}_q(\textbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left(\textbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \right)$, hence for the approximate reverse process distribution, we may write
Notice how beautifully it wraps down to just making the model to learn to approximate the noising $\boldsymbol{\epsilon}_{\theta}(\textbf{x}_t, t)$ process over any $\textbf{x}_t$ to the actual noise $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\textbf{0}, \mathbb{I})$. Hence this quite so weird learning process lets us learn the denoising reverse distribution.
Modified Objective
The authors further found that training works better by dropping off the constant term entirely, so the final objective is
Let $p$ and $q$ be two Normal Distributions denoted as $\mathcal{N}(\boldsymbol{\mu}_p, \Sigma_p)$ and $\mathcal{N}(\boldsymbol{\mu}_q, \Sigma_q)$ respectively.
Learning Abstraction$\to$ A hidden representation given the input $P(z | X)$ - this is achived by the Encoder$Q_{\theta}(z | X)$.
Generation$\to$ given some hidden representation using the Decoder$P_{\phi}(X | z)$.
For all these our aim to understand the joint distribution $P(X, z) = P(z) \cdot P(X | z)$. At inference we want given some $X$ (observed variable), finding out the most likely assignments of latent variables$z$ which would result in this observation.
Hence instead, we assume the posterior distribution $P(z | X)$ as $Q_{\theta}(z | X)$. Further assume that $Q_{\theta}(z | X)$ is a Gaussian whose parameters are determined by our neural network $\to$Encoder.
And since the final task is maximising the log-likelihood of $P(X)$, hence it is equivalent to maximizing the $\color{blue}{\text{Blue Term}}$. So, the final objective is
Now clearly all the terms are within our reach. To get the KL divergence, we make a forward pass through the Encoder to get $Q_{\theta}(z | X)$ and we know $P(z)$
Now, in order for back propogation algorithm to work, we introduce the continuity in the sampling of $z$ by moving the sampling process to an input layer this is done first by sampling from a Standard Gaussian $\epsilon \sim \mathcal{N}(0, I)$ and then obtaing $z$ with the required $\boldsymbol{\mu_z}(X), \Sigma_z(X)$
$$ z = \boldsymbol{\mu_z}(X) + \Sigma_z(X) \times \epsilon $$
Hence, the randomness has been shifted to $\epsilon$ and not the $X$ or the parameters of the model.
Generation Part
After the model parameters are learned we remove the encoder and feed a $z \sim \mathcal{N}(0, I)$ to the decoder. The decoder will then predict $f_{\phi}(z)$ and we can draw an $X \sim \mathcal{N}(f_{\phi}(z), I)$.