faceswap-GAN

Adding Adversarial loss and perceptual loss (VGGface) to deepfakes' auto-encoder architecture.

News

Date	Update
2018-02-07	Video-making: Auto downscale image resolution for face detection, preventing OOM error. This does not affect output video resolution. (Target notebook: v2_sz128_train, v2_train, and v2_test_video)

Descriptions

GAN-v1

FaceSwap_GAN_github.ipynb
1. Build and train a GAN model.
2. Use moviepy module to output a video clip with swapped face.

GAN-v2

FaceSwap_GAN_v2_train.ipynb: Detailed training procedures can be found in this notebook.
1. Build and train a GAN model.
2. Use moviepy module to output a video clip with swapped face.
FaceSwap_GAN_v2_test_img.ipynb: Provides swap_face() function that require less VRAM.
1. Load trained model.
2. Do single image face swapping.
FaceSwap_GAN_v2_test_video.ipynb
1. Load trained model.
2. Use moviepy module to output a video clip with swapped face.
faceswap_WGAN-GP_keras_github.ipynb
- This notebook contains a class of GAN mdoel using WGAN-GP.
- Perceptual loss is discarded for simplicity.
- The WGAN-GP model gave me similar result with LSGAN model after tantamount (~18k) generator updates.
```
gan = FaceSwapGAN() # instantiate the class
gan.train(max_iters=10e4, save_interval=500) # start training
```
FaceSwap_GAN_v2_sz128_train.ipynb
- Input and output images have shape (128, 128, 3).
- Minor updates on the architectures:
  1. Add instance normalization to generators and discriminators.
  2. Add additional regressoin loss (mae loss) on 64x64 branch output.
- Not compatible with _test_video and _test_img notebooks above.

Others

dlib_video_face_detection.ipynb
1. Detect/Crop faces in a video using dlib's cnn model.
2. Pack cropped face images into a zip file.
Training data: Face images are supposed to be in ./faceA/ and ./faceB/ folder for each target respectively. Face images can be of any size. (Updated 3, Jan., 2018)

Results

In below are results that show trained models transforming Hinako Sano (佐野ひなこ) to Emi Takei (武井咲).

Source video: 佐野ひなことすごくどうでもいい話？(遊戯王)

1. Autorecoder baseline

Autoencoder based on deepfakes' script. It should be mentoined that the result of autoencoder (AE) can be much better if we trained it for longer.

2. Generative Adversarial Network, GAN (version 1)

Improved output quality: Adversarial loss improves reconstruction quality of generated images. In addition, when perceptual loss is apllied, the direction of eyeballs becomes more realistic and consistent with input face.

VGGFace perceptual loss (PL): The following figure shows nuanced eyeballs direction of output faces trained with/without PL.

Smoothed bounding box (Smoothed bbox): Exponential moving average of bounding box position over frames is introduced to eliminate jittering on the swapped face. See the below gif for comparison.

A. Source face.
B. Swapped face, using smoothing mask (smoothes edges of output image when pasting it back to input image).
C. Swapped face, using smoothing mask and face alignment.
D. Swapped face, using smoothing mask and smoothed bounding box.

3. Generative Adversarial Network, GAN (version 2)

Version 1 features: Most of features in version 1 are inherited, including perceptual loss and smoothed bbox.

Segmentation mask prediction: Model learns a proper mask that helps on handling occlusion, eliminating artifacts on bbox edges, and producing natrual skin tone.

Left: Source face.
Middle: Swapped face, before masking.
Right: Swapped face, after masking.

Mask visualization: The following gif shows output mask & face bounding box.

Left: Source face.
Middle: Swapped face, after masking.
Right: Mask heatmap & face bounding box.

Optional 128x128 input/output resolution: Increase input and output size to 128x128.

Mask refinement: Tips for mask refinement are provided in the jupyter notebooks (VGGFace ResNet50 is required). The following figure shows generated masks before/after refinement. Input faces are from CelebA dataset.

Frequently asked questions

1. Video making is slow / OOM error?

It is likely due to too high resolution of input video, try to
Increase video_scaling_offset = 0 to 1 or higher (update 2018-02-07),

or disable CNN model for face detectoin (update 2018-02-07)

def process_video(...):
  ...
  #faces = get_faces_bbox(image, model="cnn") # Use CNN model
  faces = get_faces_bbox(image, model='hog') # Use default Haar features.

or reduce input size

def porcess_video(input_img):
  # Reszie to 1/2x width and height.
  input_img = cv2.resize(input_img, (input_img.shape[1]//2, input_img.shape[0]//2))
  image = input_image
  ...

2. How does it work?

This illustration shows a very high-level and abstract (but not exactly the same) flowchart of the denoising autoencoder algorithm. The objective functions look like this.

3. No audio in output clips?

Set audio=True in the video making cell.

output = 'OUTPUT_VIDEO.mp4'
clip1 = VideoFileClip("INPUT_VIDEO.mp4")
clip = clip1.fl_image(process_video)
%time clip.write_videofile(output, audio=True) # Set audio=True

Requirements

keras 2
Tensorflow 1.3
Python 3
OpenCV
dlib
face_recognition
moviepy

Acknowledgments

Code borrows from tjwei, eriklindernoren, fchollet, keras-contrib and deepfakes. The generative network is adopted from CycleGAN. Part of illustrations are from irasutoya.

flashus / faceswap-GAN