Senthi1Kumar / Wav2Lip

Lip-sync any video and audio with the Wav2Lip model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lip-sync video and audio using Wav2Lip

To create a lip-synced video of the target video and audio with the help of links provided in the given pdf. Following is the link for the Pre- Trained model required in the first assignment:

Open in [Colab Notebook] or see here


Wav2Lip is a lip-syncing algorithm that can be used to create videos of people talking, even if the audio is not originally from the device. It works by first detecting the face in the video and then generating frames that match the audio. The generated frames are then stitched together to create a new video.

From the pre-trained model Wav2Lip to lip-sync the any video to any audio.

  • Clone the repo (as the below repo works well)
git clone
  • Install the additional required libraries and in the repo's requirements.txt
!pip install -q pytube
!pip install ffmpeg-python mediapipe==0.8.11
!pip install -r requirements.txt
  • Download the pre-trained models' weights from the below links or use the shell command below
Model Description Link to the model
Wav2Lip Highly accurate lip-sync Link
Wav2Lip + GAN Slightly inferior lip-sync, but better visual quality Link
!wget '' -O 'checkpoints/wav2lip.pth'
!wget '' -O 'checkpoints/wav2lip_gan.pth'
!wget '' -O 'checkpoints/resnet50.pth'
!wget '' -O 'checkpoints/mobilenet.pth'
a = !pip install
!pip install git+
  • Download the target video using pytube
# Downloading the target YouTube video using pytube
# URL -
import pytube
yt_url = input("Enter the YouTube URL: ")
video = pytube.YouTube(yt_url)
streams = video.streams
stream = streams.filter(res='720p').first()'/content/drive/MyDrive/Wav2Lip/input_videos/')

After downloading the target video from YouTube with 720p, and we have to trim the video using any video editing softwares as because the audio file is of duration 1 minute 7 seconds. And we have to trim the video as well because the model wants the face image in all the video frames, the below fig. is marked the all non-face images in the videos and we can see the duration of each of it, finally after trimmed the video its duration is 1 minute 2 seconds.

duration of the video

Also, we have to trim the audio data as well with respect to the non face images in the video (As this approach will loss some information or mismatch of context)


python --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> 

This Drive link has the HD video output file (720p) from the pre-trained Wav2Lip + GAN model, by leveraging the T4 GPU in Google Colab.


  • The generated result video is quite fair quality of lip-syncing, because of the author's face is moving around in the video makes the model to find the face frame to makes some jitters, to resolve this issue, we have to use better face-finding algorthims such as YOLO, RetinaFace, etc.,
  • The context of the trimmed audio with the sync to the trimmed video may mismatch of context meaning.



Lip-sync any video and audio with the Wav2Lip model


Language:Jupyter Notebook 100.0%