Lip Reading using Computer Vision

This repository contains code for W251 Final Project - Watch The Whisper, an intersection of Speech, Computer Vision and Natural Language Processing. This code is based on Deep Audio-Visual Speech Recognition, which is a PyTorch reproduction of the TM-CTC model from the Deep Audio-Visual Speech Recognition paper.

Abstract

Approximately 17.9 million people in the United States have trouble using their voices. In some cases, this affects the vocal folds in the larynx which can result in complete loss of voice. This can affect the basic communication function in their daily life. We demonstrate visual speech recognition using computer vision on edge devices. The advantage of the proposed method is that it is a natural extension of their current lifestyle and is very simple to operate. This is a very promising alternative over the currently available external wearable solutions.

The final paper is here.

Details

The model was trained on LRS2 dataset for the visual speech to text transcription task.

Requirements

Recommended way to install the dependencies is creating a new virtual environment and then running requirements.txt file under server/src

pip install -r requirements.txt

Project Folder Structure

Directories

/client: Directory of client side code and corresponding Docker. This is used to capture or stream video

/server/src: Directory of server side code. The structure of server side code is as follows

/checkpoints: Temporary directory to store intermediate model weights and plots while training. Gets automatically created.

/data: Directory containing the LRS2 Main and Pretrain dataset class definitions and other required data-related utility functions.

/final: Directory to store the final trained model weights and plots. If available, place the pre-trained model weights in the models subdirectory.

/models: Directory containing the class definitions for the models.

/utils: Directory containing function definitions for calculating CER/WER, greedy search/beam search decoders and preprocessing of data samples. Also contains functions to train and evaluate the model.

Files

checker.py: File containing checker/debug functions for testing all the modules and the functions in the project as well as any other checks to be performed.

config.py: File to set the configuration options and hyperparameter values.

preprocess.py: Python script for preprocessing all the data samples in the dataset.

pretrain.py: Python script for pretraining the model on the pretrain set of the LRS2 dataset using curriculum learning.

test.py: Python script to test the trained model on the test set of the LRS2 dataset.

train.py: Python script to train the model on the train set of the LRS2 dataset.

inference.py: Python script for generating predictions with the specified trained model for incoming videos.

Results

	Professional	Author	Author	Author	Author	Author

Ground Truth	When I make my pastry	Its that simple	Its a wonderful day	Morning	How are you	Morning
Prediction	When I make my pantrick	Its that simple	Its a work of art the	More ing	On and are you	Its so boring

kasri-mids / w251-Final-Project