Image-to-Image Translation with GANs and Diffusion for Real-World Applications
JoliGAN provides easy-to-use GAN and Diffusion models for unpaired and paired image to image translation tasks, including domain adaptation. In a nutshell, JoliGAN allows for fast and stable training with astonishing results. A server with REST API is provided that allows for simplified deployment and usage.
JoliGAN has a large scope of options and parameters. To not get overwhelmed, follow the simple steps below. There are then links to more detailed documentation on models, dataset formats, and data augmentation.
Use cases
- AR and metaverse: replace any image element with super-realistic objects
- Image manipulation: seamlessly insert or remove objects/elements in images
- Image to image translation while preserving semantic, e.g. existing source dataset annotations
- Simulation to reality translation while preserving elements, metrics, ...
- Image to image translation to cope with scarce data
This is achieved by combining powerful and customized generator architectures, bags of discriminators, and configurable neural networks and losses that ensure conservation of fundamental elements between source and target images.
Example results
Image translation while preserving the class
Mario to Sonic while preserving the action (running, jumping, ...)
Object insertion
Car insertion (BDD100K) with Diffusion
Object removal
Glasses removal with GANs
AR
Real-time ring virtual try-on with GANs
video_rings_linkedin.mp4
Style transfer while preserving label boxes (e.g. cars, pedestrians, street signs, ...)
Day to night (BDD100K) with Transformers and GANs
Clear to snow (BDD100K) by applying a generator multiple times to add snow incrementally
Features
- SoTA image to image translation
- Semantic consistency: conservation of labels of many types: bounding boxes, masks, classes.
- SoTA discriminator models: projected, vision_aided, custom transformers.
- Advanced generators: real-time, transformers, hybrid transformers-CNN, Attention-based, UNet with attention, StyleGAN2
- Multiple models based on adversarial and diffusion generation: CycleGAN, CyCADA, CUT, Palette
- GAN data augmentation mechanisms: APA, discriminator noise injection, standard image augmentation, online augmentation through sampling around bounding boxes
- Output quality metrics: FID
- Server with REST API
- Support for both CPU and GPU
- Dockerized server
- Production-grade deployment in C++ via DeepDetect
Quick Start
Prerequisites
- Linux
- Python 3
- CPU or NVIDIA GPU + CUDA CuDNN
Installation
Clone this repo:
git clone --recursive https://github.com/jolibrain/joliGAN.git
cd joliGAN
Install PyTorch and other dependencies (torchvision, visdom with:
pip install -r requirements.txt --upgrade
Dataset formats
Image to image without semantics
Example: horse to zebra from two sets of images Dataset: https://www.deepdetect.com/joligan/datasets/horse2zebra.zip
horse2zebra/
horse2zebra/trainA # horse images
horse2zebra/trainB # zebra images
horse2zebra/testA
horse2zebra/testB
Image to image with class semantics
Example: font number conversion Dataset: https://www.deepdetect.com/joligan/datasets/mnist2USPS.zip
mnist2USPS/
mnist2USPS/trainA
mnist2USPS/trainA/0 # images of number 0
mnist2USPS/trainA/1 # images of number 1
mnist2USPS/trainA/2 # images of number 2
...
mnist2USPS/trainB
mnist2USPS/trainB/0 # images of target number 0
mnist2USPS/trainB/1 # images of target number 1
mnist2USPS/trainB/2 # images of target number 2
Image to image with mask semantics
Example: Add glasses to a face without modifying the rest of the face Dataset: https://www.deepdetect.com/joligan/datasets/noglasses2glasses_ffhq_mini.zip Full dataset: https://www.deepdetect.com/joligan/datasets/noglasses2glasses_ffhq.zip
noglasses2glasses_ffhq_mini
noglasses2glasses_ffhq_mini/trainA
noglasses2glasses_ffhq_mini/trainA/img
noglasses2glasses_ffhq_mini/trainA/img/0000.png # source image, e.g. face without glasses
...
noglasses2glasses_ffhq_mini/trainA/bbox
noglasses2glasses_ffhq_mini/trainA/bbox/0000.png # source mask, e.g. mask around eyes
...
noglasses2glasses_ffhq_mini/trainA/paths.txt # list of associated source / mask images
noglasses2glasses_ffhq_mini/trainB
noglasses2glasses_ffhq_mini/trainB/img
noglasses2glasses_ffhq_mini/trainB/img/0000.png # target image, e.g. face with glasses
...
noglasses2glasses_ffhq_mini/trainB/bbox
noglasses2glasses_ffhq_mini/trainB/bbox/0000.png # target mask, e.g. mask around glasses
...
noglasses2glasses_ffhq_mini/trainB/paths.txt # list of associated target / mask images
Image to image with bounding box semantics
Example: Super Mario to Sonic while preserving the position and action, e.g. crouch, jump, still, ... Dataset: https://www.deepdetect.com/joligan/datasets/online_mario2sonic_lite.zip Full dataset: https://www.deepdetect.com/joligan/datasets/online_mario2sonic_full.tar
online_mario2sonic_lite
online_mario2sonic_lite/mario
online_mario2sonic_lite/mario/bbox
online_mario2sonic_lite/mario/bbox/r_mario_frame_19538.jpg.txt # contains bboxes, see format below
online_mario2sonic_lite/mario/imgs
online_mario2sonic_lite/mario/imgs/mario_frame_19538.jpg
online_mario2sonic_lite/mario/all.txt # list of associated source image / bbox file,
...
online_mario2sonic_lite/sonic
online_mario2sonic_lite/sonic/bbox
online_mario2sonic_lite/sonic/bbox/r_sonic_frame_81118.jpg.txt
online_mario2sonic_lite/sonic/imgs
online_mario2sonic_lite/sonic/imgs/sonic_frame_81118.jpg
online_mario2sonic_lite/sonic/all.txt # list of associated target image / bbox file
...
online_mario2sonic_lite/trainA
online_mario2sonic_lite/trainA/paths.txt # symlink to ../mario/all.txt
online_mario2sonic_lite/trainB
online_mario2sonic_lite/trainB/paths.txt # symlink to ../sonic/all.txt
List file format:
cat online_mario2sonic_lite/mario/all.txt
mario/imgs/mario_frame_19538.jpg mario/bbox/r_mario_frame_19538.jpg.txt
Bounding boxes format, e.g. r_mario_frame_19538.jpg.txt
:
2 132 167 158 218
in this order:
cls xmin ymin xmax ymax
where cls
is the class, in this dataset 2
means running
.
Image to image with multiple semantics: bounding box and class
Example: Image seasonal modification while preserving objects with mask (cars, pedestrians, ...) and overall image weather (snow, rain, clear, ...) with class Dataset: https://www.deepdetect.com/joligan/datasets/daytime2dawn_dusk_lite.zip
daytime2dawn_dusk_lite
daytime2dawn_dusk_lite/dawn_dusk
daytime2dawn_dusk_lite/dawn_dusk/img
daytime2dawn_dusk_lite/dawn_dusk/mask
daytime2dawn_dusk_lite/daytime
daytime2dawn_dusk_lite/daytime/img
daytime2dawn_dusk_lite/daytime/mask
daytime2dawn_dusk_lite/trainA
daytime2dawn_dusk_lite/trainA/paths.txt
daytime2dawn_dusk_lite/trainB
daytime2dawn_dusk_lite/trainB/paths.txt
paths.txt
format:
cat trainA/paths.txt
daytime/img/00054602-3bf57337.jpg 2 daytime/mask/00054602-3bf57337.png
in this order: source image path
, image class
, image mask
, where image class
in this dataset represents the weather class.
Other semantics
Other semantics are possible, i.e. an algorithm that runs on both source and target
JoliGAN training
Training requires the following:
- GPU
- a
checkpoints
directory to be specified in which model weights are stored - a Visdom server, by default the training script starts a Visdom server on http://0.0.0.0:8097 if none is running
- Go to http://localhost:8097 to follow training losses and image result samples
JoliGAN has (too) many options, for finer grained control, see the full option list.
Training image to image without semantics
Modify as required and run with the following line command:
python3 train.py --dataroot /path/to/horse2zebra --checkpoints_dir /path/to/checkpoints --name horse2zebra \
--output_display_env horse2zebra --data_load_size 256 --data_crop_size 256 --train_n_epochs 200 \
--dataset_mode unaligned --train_n_epochs_decay 0 --model_type cut --G_netG mobile_resnet_attn
Training with class semantics :
python3 train.py --dataroot /path/to/mnist2USPS --checkpoints_dir /path/to/checkpoints --name mnist2USPS \
--output_display_env mnist2USPS --data_load_size 180 --data_crop_size 180 --train_n_epochs 200 \
--data_dataset_mode unaligned_labeled_cls --train_n_epochs_decay 0 --model_type cut --cls_semantic_nclasses 10 \
--train_sem_use_label_B --train_semantic_cls --dataaug_no_rotate --dataaug_D_noise 0.001 \
--G_netG mobile_resnet_attn
Training with mask semantics :
python3 train.py --dataroot /path/to/noglasses2glasses_ffhq/ --checkpoints_dir /path/to/checkpoints/ \
--name noglasses2glasses --output_display_env noglasses2glasses --output_display_freq 200 --output_print_freq 200 \
--train_G_lr 0.0002 --train_D_lr 0.0001 --train_sem_lr_f_s 0.0002 --data_crop_size 256 --data_load_size 256 \
--data_dataset_mode unaligned_labeled_mask --model_type cut --train_semantic_mask --train_batch_size 2 \
--train_iter_size 1 --model_input_nc 3 --model_output_nc 3 --f_s_net unet --train_mask_f_s_B \
--train_mask_out_mask --f_s_semantic_nclasses 2 --G_netG mobile_resnet_attn --alg_cut_nce_idt \
--train_sem_use_label_B --D_netDs projected_d basic vision_aided --D_proj_interp 256 \
--D_proj_network_type efficientnet --train_G_ema --G_padding_type reflect --dataaug_no_rotate \
--data_relative_paths
Training with bounding box semantics and online sampling around boxes as data augmentation:
python3 train.py --dataroot /path/to/online_mario2sonic/ --checkpoints_dir /path/to/checkpoints/ \
--name mario2sonic --output_display_env mario2sonic --output_display_freq 200 --output_print_freq 200 \
--train_G_lr 0.0002 --train_D_lr 0.0001 --train_sem_lr_f_s 0.0002 --data_crop_size 128 --data_load_size 180 \
--data_dataset_mode unaligned_labeled_mask_online --model_type cut --train_semantic_m --train_batch_size 2 \
--train_iter_size 1 --model_input_nc 3 --model_output_nc 3 --f_s_net unet --train_mask_f_s_B \
--train_mask_out_mask --data_online_creation_crop_size_A 128 --data_online_creation_crop_delta_A 50 \
--data_online_creation_mask_delta_A 50 --data_online_creation_crop_size_B 128 \
--data_online_creation_crop_delta_B 15 --data_online_creation_mask_delta_B 15 \
--f_s_semantic_nclasses 2 --G_netG segformer_attn_conv \
--G_config_segformer models/configs/segformer/segformer_config_b0.py --alg_cut_nce_idt --train_sem_use_label_B \
--D_netDs projected_d basic vision_aided --D_proj_interp 256 --D_proj_network_type vitsmall \
--train_G_ema --G_padding_type reflect --dataaug_no_rotate --data_relative_paths
Training object insertion :
Trains a diffusion model to insert glasses onto faces.
python3 train.py --dataroot /path/to/noglasses2glasses_ffhq/ --checkpoints_dir /path/to/checkpoints/ \
--name noglasses2glasses --data_direction BtoA --output_display_env noglasses2glasses --gpu_ids 0,1 \
--model_type palette --train_batch_size 4 --train_iter_size 16 --model_input_nc 3 --model_output_nc 3 \
--data_relative_paths --train_G_ema --train_optim radam --data_dataset_mode self_supervised_labeled_mask \
--data_load_size 256 --data_crop_size 256 --G_netG unet_mha --data_online_creation_rand_mask_A \
--train_G_lr 0.00002 --train_n_epochs 400 --dataaug_no_rotate --output_display_freq 10000 \
--train_optim adamw --G_nblocks 2
JoliGAN inference
JoliGAN reads the model configuration from a generated train_config.json
file that is stored in the model directory. When loading a previously trained model, make sure the the train_config.json
file is in the directory.
Python scripts are provided for inference, that can be used as a baseline for using a model in another codebase.
Generate an image with a GAN generator model
cd scripts
python3 gen_single_image.py --model-in-file /path/to/model/latest_net_G_A.pth \
--img-in /path/to/source.jpg --img-out target.jpg
Generate an image with a diffusion model
Using a pretrained glasses insertion model (see above):
python3 gen_single_image_diffusion.py --model-in-file /path/to/model/latest_net_G_A.pth --img-in /path/to/source.jpg\
--mask-in /path/to/mask.jpg --img-out target.jpg --img-size 256
The mask image has 1 where to insert the object and 0 elsewhere.
JoliGAN server
Ensure everything is installed
pip install fastapi uvicorn
Then run server:
server/run.sh --host localhost --port 8000
Unit tests
To launch tests before new commits:
bash scripts/run_tests.sh /path/to/dir
Models
Name | Paper |
---|---|
CycleGAN | https://arxiv.org/abs/1703.10593 |
CyCADA | https://arxiv.org/abs/1711.03213 |
CUT | https://arxiv.org/abs/2007.15651 |
RecycleGAN | https://arxiv.org/abs/1808.05174 |
StyleGAN2 | https://arxiv.org/abs/1912.04958 |
Generator architectures
Architecture | Number of parameters |
---|---|
Resnet 9 blocks | 11.378M |
Mobile resnet 9 blocks | 1.987M |
Resnet attn | 11.823M |
Mobile resnet attn | 2.432M |
Segformer b0 | 4.158M |
Segformer attn b0 | 4.60M |
Segformer attn b1 | 14.724M |
Segformer attn b5 | 83.016M |
UNet with mha | ~60M configurable |
ITTR | ~30M configurable |
Docker build
To build a docker for joliGAN server:
docker build -t jolibrain/joligan_build -f docker/Dockerfile.build .
docker build -t jolibrain/joligan_server -f docker/Dockerfile.server .
To run the joliGAN docker:
nvidia-docker run jolibrain/myjoligan
Code format
If you want to contribute please use black code format. Install:
pip install black
Usage :
black .
If you want to format the code automatically before every commit :
pip install pre-commit
pre-commit install
Authors
JoliGAN is created and maintained by Jolibrain.
Code is making use of pytorch-CycleGAN-and-pix2pix, CUT, AttentionGAN, MoNCE among others.