sarthak268 / EmoGen

Emotionally Enhanced Talking Face Generation (an extension to Wav2Lip)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Emotionally Enhanced Talking Face Generation

This repository is the official PyTorch implementation of our paper: Emotionally Enhanced Talking Face Generation. We introduce a multimodal framework to generate lipsynced videos agnostic to any arbitrary identity, language, and emotion. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.

Model

📑 Original Paper 📰 Project Page 🌀 Demo Live Testing
Paper Project Page Demo Video Interactive Demo

Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only.

Prerequisites

  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt.
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.

Preparing CREMA-D for training

Download data

Download the data from this repo.

Convert videos to 25 fps

python convertFPS.py -i <raw_video_folder> -o <folder_to_save_25fps_videos>

Preprocess dataset

python preprocess_crema-d.py --data_root <folder_of_25fps_videos> --preprocessed_root preprocessed_dataset/

Train!

There are three major steps: (i) Train the expert lip-sync discriminator, (ii) Train the emotion discriminator (iii) Train the EmoGen model.

Training the expert discriminator

python color_syncnet_train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints>

Training the emotion discriminator

python emotion_disc_train.py -i preprocessed_dataset/ -o <folder_to_save_checkpoints>

Training the main model

python train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint> --emotion_disc_path <path_to_emotion_disc_checkpoint>

You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py file.

Inference

Model Weights

Model description Link to the model
Emogen (PL+DA) Link
Emogen (PRE) Link

Comment these code lines for inference: line1 and line2.

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> --emotion <categorical emotion>

The result is saved (by default) in results/{emotion}.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio. Choose categorical emotion from this list: [HAP, SAD, FEA, ANG, DIS, NEU].

Tips for better results:

  • Experiment with the --pads argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. --pads 0 20 0 0.
  • If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the --nosmooth argument and give another try.
  • Experiment with the --resize_factor argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).

Evaluation

Please check the evaluation/ folder for the instructions.

License and Citation

Theis repository can only be used for personal/research/non-commercial purposes. Please cite the following paper if you use this repository:

@misc{goyal2023emotionally,
      title={Emotionally Enhanced Talking Face Generation}, 
      author={Sahil Goyal and Shagun Uppal and Sarthak Bhagat and Yi Yu and Yifang Yin and Rajiv Ratn Shah},
      year={2023},
      eprint={2303.11548},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

The code structure is inspired by Wav2Lip. We thank the authors for the wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. Demo website is developed by @ddhroov10 and @SakshatMali.

About

Emotionally Enhanced Talking Face Generation (an extension to Wav2Lip)


Languages

Language:Python 99.8%Language:Shell 0.2%