Image synthesis for action localisation

This project was developed throughout the spring semester of 2022, at Aalborg University. The purpose of this project was to develop three step pipeline to do action localization in thermal domain:

Step 1: Style transfer from RGB to thermal domain.
Step 2: Action localisation in RGB domain.
Step 3: Domain adaptation for action localisation in thermal domain.

At present work has been made on step 1 and step 2. This project was made and tested on Windows Subsystems for Linux (WSL) running Ubuntu 20.04.

The used hardware for training and testing of all the models is a 8 gigabyte NVIDIA GeForce RTX 2070.

Dependencies and installation

To install this repository run the following:

git clone https://github.com/dleporis/VGIS_841_ImageSynthesis.git

In order to run this project, the following dependencies will need to be installed:
pytorch and cuda toolkit
scipy
fvcore
numpy
Pillow
visdom
dominate
torchvision
OpenCV
wandb

An anaconda environment.yml with all the dependencies has been created for easy installation of this repository:

conda env create -f environment.yml

For creation of separate environments (Pix2Pix, CycleGAN and YOWO), run the following command:

conda env create -f env/environment_<name>.yml

Style transfer

The style transfer is conducted using Pix2Pix and CycleGAN, using the original implementation of pytorch-CycleGAN-and-pix2pix.

Augmentations have been made to base_options.py (lines 53-61) to allow use of augmentations parameters in the dataloader.
In base_dataset.py (line 100-102) a transform has been added to augment the images using transform functions from torchvision: RandomAutocontrast and GaussianBlur.

In networks.py (line 148-154) more network generators has been added to include ResNet18, ResNet32 and ResNet36. Training and testing of the style transfer models (Pix2Pix and CycleGAN) follows the official pytorch-CycleGAN-and-pix2pix README.md, and used a batch size of 1.

Style transfer outputs

Comparison study between Pix2Pix and CycleGANs synthetic images on images from KAIST.
Top row: Input RGB images.
Second row: Groundtruth thermal images. Third row: Pix2Pix thermal images.
Fourth row: CycleGAN thermal images.

At its current state the style transfer model is not optimal, more data and batch sizes are needed to make more generalising models.

Action localisation

The action localisation is performed by training a YOWO model following the original implementation of You Only Watch Once (YOWO)

Due to low available memory on the GPU in use the model was trained on the UCF101-24 dataset using a batch size of 4 for 5 epochs. Installation instructions on YOWO is provided in their README.md.

Datasets and trained models

We have made the trained style transfer and action localisation models a available to download as a zip file on a google drive Trained Models.

The datasets used for testing and training CycleGAN and Pix2Pix can be found in a folder on the same google drive datasets.
The datasets used for the style transfer models were created following the guide from the original implementation of Pix2Pix here

The created datasets contain the following number of images:

	num of images	From datasets	Paired/Unpaired
kaist-test-set-v2	1617	RGB: KAIST Thermal: KAIST	Paired
pix2pix-training-data	4162	RGB: KAIST Thermal: KAIST	Paired
cycleGAN-ucfjhmdb	2003	RGB: UCF101-24, JHMDB Thermal: KAIST, LTD	Unpaired
cycleGAN-kaist	2003	RGB:KAIST Thermal: KAIST, LTD	Unpaired

dleporis / VGIS_841_ImageSynthesis