3DTalkingFaceMesh is a deep learning model that generates 3D facial meshes, which are lip-synchronized to speech input. Given a raw audio file, the model outputs a sequence of facial landmarks that adhere to the MediaPipe FaceMesh standard. The model is built upon a Transformer Encoder-Decoder architecture, with the Encoder utilizing OpenAI Whisper for speech recognition.
To set up the necessary environment, please install the dependencies listed in the requirements.txt
file. Additionally, you will need ffmpeg
for video and audio processing.
The Obama Weekly Address dataset consists of high-quality, fixed-pose speech videos featuring former President Barack Obama. You can find the list of YouTube video IDs in data/video_links.txt
.
To download the raw videos, execute the following command:
python download_video.py
This command creates a data/video
folder that contains all the downloaded video files.
The VoxCeleb2 dataset comprises over 1 million wild YouTube videos featuring more than 6,000 celebrities speaking. However, we found that training the model solely on Obama speech videos was sufficient for zero-shot adaptation to other voices and languages, thanks to the utilization of Whisper Speech Embeddings.
-
Extract Landmarks: To extract facial landmarks from the videos, run:
python utils/extract_landmarks.py
This command creates a
data/landmarks
folder containing extracted landmarks for each video, and adata/validity
folder containing frame-specific labels that indicate whether the corresponding landmarks are valid. A frame is considered invalid if no faces are detected or if multiple faces are present. -
Extract Audio: To extract audio from the raw videos and save it as a numpy npy array format, run:
python utils/load_audio.py
This will create the
data/audio
folder containing extracted audio wav npy files from raw videos
To train the 3DTalkingFaceMesh model, execute the following command:
python main.py
You can find a list of available arguments by checking the main function. For example:
python main.py --seed 42 --lr 0.003 --feature_dim 256 --optimizer lion --audio_encoder whisper
This command trains the model with a fixed seed of 42, a learning rate of 0.003, a hidden Transformer decoder dimension of 256, the LION optimizer, and Whisper speech embeddings.