faceswap-GAN

Adding Adversarial loss and perceptual loss (VGGface) to deepfakes' auto-encoder architecture.

Descriptions

FaceSwap_GAN_github.ipynb: This jupyter notebook does the following jobs:
1. Build a GAN model.
2. Train the GAN from scratch.
3. Detect faces in an image using dlib's cnn model.
4. Use GAN to transform detected face into target face.
5. Use moviepy module to output a video clip with swapped face.
dlib_video_face_detection.ipynb: This jupyter notebook does the following jobs:
1. Detect/Crop faces in a video using dlib's cnn model.
2. Pack cropped face images into a zip file.
Training data: Training images are supposed to be in ./TE/ and ./SH/ folder for each target respectively. Face images can be of any size.

Results

In below are results that show trained models transforming Hinako Sano (佐野ひなこ, left) to Emi Takei (武井咲, right).

Source video: 佐野ひなことすごくどうでもいい話？(遊戯王)

1. Autorecoder

Autoencoder based on deepfakes' script. It should be mentoined that the result of autoencoder (AE) can be much better if we trained it for longer.

2. Generative Adversarial Network, GAN (adding VGGFace perceptual loss)

Adversarial loss improves resolution of generated images. In addition, when perceptual loss is apllied, the movemnet of eyeballs becomes more realistic and consistent with input face.

Perceptual loss (PL): The following figure shows nuanced eyeballs direction in model output trained with/wihtout PL.

Smoothed bounding box: Exponential moving average of bounding box position over frames is introduced to eliminate jittering on the swapped face. See the below gif for comparison. (Updated 29, Dec., 2017)

A. Source face
B. Swapped face, using smoothing mask
C. Swapped face, using smoothing mask and face alignment
D. Swapped face, using smoothing mask and smoothed bounding box

WIP

Mask geneartion: Model learns a proper mask that can help on handling occlusion.

Left: Source face
Middle: Swapped face, before masking
Right: Swapped face, after masking

Mask Visualization: Make video clips that shows mask heatmap & face bounding box.

Left: Source face
Middle: Swapped face, after masking
Right: Mask heatmap & face bounding box

Requirements

keras 2
Tensorflow 1.3
Python 3
OpenCV
dlib
face_recognition
moviepy

Notes:

BatchNorm/InstanceNorm: Caused input/output skin color inconsistency when the 2 training dataset had different skin color dsitribution (light condition, shadow, etc.).
Increasing perceptual loss weighting factor (to 1) unstablized training. But the weihgting [.01, .1, .1] I used is not optimal either.
In the encoder architecture, flattening Conv2D and shrinking it to Dense(1024) is crutial for model to learn semantic features, or face representation. If we used Conv layers only (which means larger dimension), will it learn features like visaul descriptors? (source paper, last paragraph of sec 3.1)
Transform Emi Takei to Hinko Sano gave suboptimal results, due to imbalanced training data that over 65% of images of Hinako Sano came from the same video series.
Mixup technique (arXiv) and least squares loss function are adopted (arXiv) for training GAN. However, I did not do any ablation experiment on them. Don't know how much impact they had on outputs.
Since humna faces are not 100% symmetric, should we remove random flipping from data augmenattion for model to learn better features? Maybe the generated faces will look more like the taget.

TODO

Use Kalman filter to track bounding box.

Acknowledgments

Code borrows from tjwei and deepfakes. The generative network is adopted from CycleGAN.

angilis / faceswap-GAN