image-colorization

This framework facilitates the training and evaluation of various deep neural networks for the task of image colorization. In particular, it offers the following colorization models, features and evaluation methods:

Colorization models

ResNet Colorization Network
Conditional GAN (CGAN)
U-Net

Evaluation methods and metrics

The Mean Squared Error (MSE)
The Mean LPIPS Perceptual Similarity (PS)
Semantic Interpretability (SI)

Prerequisites

The framework is implemented in Python (3.6) using PyTorch v1.0.1.

Please consult ./env/mlp_env.yml for a full list of the dependencies of the Conda environment that was used in the development of this framework. If Conda is used as as package and environment manager, one can use conda create --name myenv --file ./env/mlp_env.txt to recreate the aforementioned environment.

Structure

train.py - main entry point of the framework
src/options.py - parses arguments (e.g. task specification, model options)
src/main.py - set-up of task environment (e.g. models, dataset, evaluation method)
src/dataloaders.py - downloads and (sub)samples datasets, and provides iterators over the dataset elements.
src/models.py - contains the implementations of the model architectures
src/utils.py - contains various helper functions and classes
src/colorizer.py - trains and validates colorization models
src/classifier.py - trains and validates image-classification models (used for SI)
src/eval_gen - contains helper functions for the evaluation of model colorizations
src/eval_mse.py - evaluates colorizations by MSE
src/eval_ps.py - evaluates colorizations by the Mean LPIPS Perceptual Similarity (PS)
src/eval_si.py - evaluates colorizations by Semantic Interpretability (SI)

Usage

Training of models
python train.py [--option ...] where the options are:

option	description	type	oneOf	default
`seed`	random seed	`int`	not applicable	`0`
`task`	the task that should be executed	`str`	`['colorizer', 'classifier', 'eval-gen', 'eval-si', 'eval-ps', 'eval-mse']`	`'colorizer'`
`experiment-name`	the name of the experiment	`str`	not applicable	`'experiment_name'`
`model-name`	colorization model architecture that should be used	`str`	`['resnet', 'unet32', 'unet224', 'nazerigan32', 'nazerigan224' 'cgan']`	`'resnet'`
`model-suffix`	colorization model name suffix	`str`	not applicable	not applicable
`model-path`	path for the pretrained models	`str`	not applicable	`'./models'`
`dataset-name`	the dataset to use	`str`	`['placeholder', 'cifar10', 'places100', 'places205', 'places365']`	`'placeholder'`
`dataset-root-path`	dataset root path	`str`	not applicable	`'./data'`
`use-dataset-archive`	load dataset from TAR archive	`str2bool`	`[True, False]`	`False`
`output-root-path`	path for output (e.g. model weights, stats, colorizations)	`str`	not applicable	`'./output'`
`max-epochs`	maximum number of epochs to train for	`int`	not applicable	`5`
`train-batch-size`	training batch size	`int`	not applicable	`100`
`val-batch-size`	validation batch size	`int`	not applicable	`100`
`batch-output-frequency`	frequency with which to output batch statistics	`int`	not applicable	`1`
`max-images`	maximum number of images from the validation set to be saved (per epoch)	`int`	not applicable	`10`
`eval-root-path`	the root path for evaluation images	`str`	not applicable	`'./eval'`
`eval-type`	the type of evaluation task to perform	`str`	`['original, 'grayscale', 'colorized']`	`'original'`

So one could for example train a cgan colorization model on the places365 dataset for 100 epochs by running:

python train.py \
  --experiment-name cgan_experiment001 \  
  --model-name cgan \        
  --dataset-name places365 \ 
  --max-epochs 100 \
  --train-batch-size 16 \
  --val-batch-size 16 \

Colorization Task

The task of colorizing a image can be considered a pixel-wise regression problem where the model input X is a 1xHxW tensor containing the pixels of the grayscale imageand the model output Y' a tensor of shape nxHxW that represents the predicted colorization information. Specifically, the task aims to discover a mapping F: X → Y' that plausibly predicts the colorization given the greyscale input.

The CIE L*a*b* colour space lends itself well to this task since the L channel depicts the brightness of the image (X above) and the image colour is fully captured in the remaining a and b channels (Y' above). The L*a*b* colour model also has the advantage of being inspired by human colour perception, meaning that distances in L*a*b* space can be expected to be correlated with changes in human colour perception. The final output colorized image is created by recombining the input L layer with the predicted a and b layers.

Colorization Models

Three colorization architectures are currently supported in the framework.

ResNet Colorization Network

This architecture consists of a CNN that starts out with a set of convolutional layers which aim to extract low-level and semantic features from the set of input images, inspired by how representations are learned in Learning Representations for Automatic Colorization. Based on the same idea as behind the VGG-16-Gray architecture in this paper, a modified version of the image classification network that is ResNet-18 is used as a means to learn representations from a set of images. In particular, the network is modified in such a way that it accepts greyscale images and in addition, the network is truncated to six layers. This set of layers is used to extract features from the images that are represented by their lightness channels. Subsequently a series of deconvolutional layers is applied to increase the spacial resolution of (i.e. 'upscale') the features. This up-scaling of features learned in a network is inspired by the 'upsampling' of features in the colorization network of Let There Be Color!

U-Net

This network is inspired by U-Net: Convolutional Networks for Biomedical Image Segmentation where direct connections are added between contracting and expanding layers of equal size to prevent the loss of spatial context of the original image throughout the layers. In Image Colorization with Generative Adversarial Networks an approach is proposed that uses such a network for colorization since the preservation of the original greyscale image is of particular importance to this task.

The network implemented in this paper has the same architecture as the one presented in the original U-Net paper (see image above), modified to take 224x224 inputs. Non-linearities are introduced by following convolutional and deconvolutional layers with leaky ReLUs with slope of 0.2. Furthermore batch normalisation is applied after every layer.

Conditional GAN (CGAN)

Recent research on image colorization has demonstrated the potential for using GAN architectures for image colorization tasks. One of the compelling aspects of using GANs is their ability to learn a loss function that is task-specific.

GANs consist of two networks: a generator and a discriminator. In the context of image colorization the generator’s task is to produce colorized images that are indistinguishable from real images. The discriminator’s task is to classify whether a sample came from the generator or from the original set of images. Traditionally, the generator is represented by a mapping , where z is a random noise variable which serves as the input of the generator. The discriminator is in a similar fashion represented by the mapping where x represents a real or synthetic input.

In the context of image colorization, the traditional GAN has to be modified into a Conditional GAN (CGAN) such that it takes as image data as input instead of (random) noise. More specifically, the CGAN will take as input greyscale data (i.e. images represented by their lightness channel L in the L*a*b* colour space) and generate colorized images. The discriminator will be trained on both the generated colorized images and full-colour ground truth images.

Formally, the main objective of the CGAN can be described by a single mini-max game problem:

Where represents the original image distribution. So informally, the generator tries to minimise the function by generating samples according to a mapping taking as input greyscale images x from the original data while the discriminator tries to maximise the same function by trying to distinguish between real images y from the original data distrbituion and generated samples .

In addition, the framework facilitates the addition of an L1-regularisation term in order to try to force the generator to produce results that are 'closer’ (i.e. more similar) to images from the original data distribution. Theoretically, this should preserve the structure of the ground-truth images and in addition prevent the generator from prodcuing images where it has given certain pixels or even whole image regions a random colour just to deceive the discriminator.

akbokha / image-colorization