peterwisu / lip-synthesis

Audio-Visual Lip Synthesis via Intermediate Landmark Representation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Audio-Visua Lip Synthesis via intermediate landmark representation | Final Year Project (Dissertation) of Wish Suharitdamrong

This is a code implementation for Wish Suharitdamrong's Final Year Project Year 3 BSc Computer Science at University of Surrey on the topic of Audio-Visua Lip Synthesis via intermediate landmark representation.

Alt Text Alt Text

Demo

Online demonstration is available at πŸ€— HuggingFace

Installation

There are two ways of installing package using conda or pip

1.Create virtual conda environment from environment.yml

2.Use pip to install a pakages (make sure you use python 3.7or above since older version might not support some libraries)

Use Conda

# Create virtual environment from .yml file
conda env create -f environment.yml

# activate virtual environment
conda activate fyp

Use pip

# Use pip to install require packages
pip install -r requirement.txt

Dataset

The audio-visual dataset used in this proejct are LRS2 and LRS3. LRS2 data was use for both model training and evaluation. LRS3 data was only used for model evaluation.

Dataset Page
LRS2 Link
LRS3 Link

Pre-train weights

Generator model

Download weights Generator model

Model Donwload Link
Generator Link
Generator + SyncLoss Link
Attention Generator + SyncLoss Link

Landmark SyncNet discriminator

Download weights for Landmark-based SyncNet model Download Link

Image-to-Image Translation

Pre-trained weight for Image2Image Translation model can be download from MakeItTalk repository on their pre-trained models section Repo Link.

Directory

β”œβ”€β”€ checkpoint #  Directory for model checkpoint
β”‚   └── generator   # put Generator model weights here
β”‚   └── syncnet     # put Landmark SyncNet model weights here
β”‚   └── image2image # put Image2Image Translation model weights here

Run Inference

python run_inference.py --generator_checkpoint <checkpoint_path> --image2image_checkpoint <checkpoint_path> --input_face <image/video_path> --input_audio <audio_source_path>

Data Preprocessing

I used same ways of data preprocessing as Wav2Lip for more details of folder structure can be find in their repository Here.

python preprocess_data.py --data_root data_root/main --preprocessed_root preprocessed_lrs2_landmark/

Train Model

Generator

# CLI for traning attention generator with pretrain landmark SyncNet discriminator
python run_train_generator.py --model_type attnlstm --train_type pretrain --data_root preprocessed_lrs2_landmark/ --checkpoint_dir <folder_to_save_checkpoints>

Landmark SyncNet

# CLI for training pretrain landmark SyncNet discriminator
python run_train_syncnet.py --data_root preprocessed_lrs2_landmark/ --checkpoint_dir <folder_to_save_checkpoints>

Generate video for evaluation & benchmark from LRS2 and LRS3

This project used data from LRS2 and LRS3 dataset for quantitative evaluation, the list of evaluation data is provide from Wav2Lip. The filelist(video and audio data used for evaluation) and details about Lip Sync benchmark are available in their repository Here.

Generate evaluation from filelist

cd evaluation
# generate evaluation videos
python gen_eval_vdo.py --filelist <path> --data_root <path>  --model_type <type_of_model> --result_dir <save_path> --generator_checkpoint <gen_ckpt> --image2image_checkpoint <image2image_checkpoint>

Acknowledgement

The code base of this project was inspired from Wav2Lip and MakeItTalk. I would like to thanks author of both project for making code implementation of their amazing work available online.

About

Audio-Visual Lip Synthesis via Intermediate Landmark Representation


Languages

Language:Python 100.0%