Image Generation and Image-to-Image Translation

Part A - Image Generation

In this we train generative models to generate images from random vectors sampled from the latent space, employing Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models on the MNIST dataset.

VAE

The VAE is structured only with Linear Layers, since the simplicity of the dataset meant we weren’t required to make the model more complex with ideas like Conv and Pool Layers.

Our architecture involves the model learning the mean and variance of a latent variable using an Encoder, and re-constructing the images from that using a Decoder. We used the relu and sigmoid activation functions in the layers.

GAN

The GAN is trained to generate realistic MNIST digit images. The generator learns to generate images from random noise, while the discriminator learns to distinguish between real and fake images. The training loop alternates between training the discriminator and the generator.

The generator aims to minimize the discriminator's ability to distinguish fake images from real ones. The discriminator aims to maximize its ability to distinguish between real and fake images.

We used the relu in the layers of the generator and relu and sigmoid activation functions in the layers of the discriminator. We also chose to use the Adam Optimizer, since it generally requires a lower learning rate, and that it converges faster.

The following are the images of the reconstructions of random points for latent sizes 2, 4 and 8.

The following image also depicts how the training loss reduces over epochs for each latent size for the generator.

Diffusion Models

We implement a diffusion model for the MNIST dataset. Diffusion models are used for image generation and manipulation tasks. This model applies forward and reverse diffusion processes to images, allowing for the generation of high-quality samples.

SiLU (Sigmoid-weighted Linear Unit) is used as the activation function.

We also chose to use the Adam Optimizer, since it generally requires a lower learning rate, and that it converges faster.

These are the results obtained from the diffusion model in multiple steps.

Part B - Image-to-Image Translation

We focus on image-to-image translation, aiming to convert source images into target images while altering specific visual properties while preserving others. This is accomplished by employing two variants of Generative Adversarial Networks (GANs) on the CelebA dataset.

CycleGAN

We implement a CycleGAN (Cycle-Consistent Generative Adversarial Network) for image-to-image translation. CycleGANs are used to learn mappings between two different domains without requiring paired data.

The activation functions used are

Leaky ReLU: Used in both the generator and discriminator.
ReLU: Used in the generator for non-linearity.
Sigmoid: Used in the discriminator to output probabilities.

We also chose to use the Adam Optimizer, since it generally requires a lower learning rate, and that it converges faster.

Adversarial Loss (GAN Loss)

Binary Cross-Entropy (BCE) loss is used to train the generators and discriminators by minimizing the difference between real and fake predictions.

Cycle Consistency Loss

Mean Absolute Error (L1 loss) is used to enforce cycle consistency between the original and reconstructed images, ensuring that the image after translation and back-translation is close to the original image.