JeongHun0716 / vsr-low

Visual Speech Recognition For Low-Resource Languages with Automatic Labels

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper

This repository contains the Official PyTorch implementation code of the following paper:

Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper

*Jeong Hun Yeo, *Minsu Kim, Shinji Watanabe, and Yong Man Ro
[Paper]

We release the automatic labels of the four low-resource languages(French, Italian, Portuguese, and Spanish).

To generate the automatic labels, we identify the languages of all videos in VoxCeleb2 and AVSpeech, and then the transcription (automatic labels) is produced by the pretrained ASR model. In this project, we use a "whisper/large-v2" model to conduct these processes.

Environment Setup

conda create -n vsr-low python=3.9 -y
conda activate vsr-low
git clone https://github.com/JeongHun0716/vsr-low
cd vsr-low
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
pip install hydra-core==1.3.0
pip install omegaconf==2.3.0
pip install pytorch-lightning==1.5.10
pip install sentencepiece
pip install av

Dataset preparation

Multilingual TEDx(mTEDx), VoxCeleb2, and AVSpeech Datasets.

  1. Download the mTEDx dataset from the mTEDx link of the official website.
  2. Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
  3. Download the AVSpeech dataset from the AVSpeech link of the official website.

When you are interested in training the model for a specific target language VSR, we recommend using language-detected files (e.g., link provided in this project instead of video lists of the AVSpeech dataset provided on the official website to reduce the dataset preparation time. Because of the huge amount of AVSpeech dataset, it takes a lot of time.

Preprocessing

After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.

Training the Model

The training code is available soon.

Inference

Download the checkpoints from the below links and move them to the pretrained_models directory. You can evaluate the performance of each model using the scripts available in the scripts directory.

Pretrained Models

mTEDx Fr
Model Training Datasets Training data (h) WER [%] Target Languages
ckpt.pt mTEDx 85 65.25 Fr
ckpt.pt mTEDx + VoxCeleb2 209 60.61 Fr
ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 331 58.30 Fr
mTEDx It
Model Training Datasets Training data (h) WER [%] Target Languages
ckpt.pt mTEDx 46 60.40 It
ckpt.pt mTEDx + VoxCeleb2 84 56.48 It
ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 152 51.79 It
mTEDx Es
Model Training Datasets Training data (h) WER [%] Target Languages
ckpt.pt mTEDx 72 59.91 Es
ckpt.pt mTEDx + VoxCeleb2 114 54.05 Es
ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 384 45.71 Es
mTEDx Pt
Model Training Datasets Training data (h) WER [%] Target Languages
ckpt.pt mTEDx 82 59.45 Pt
ckpt.pt mTEDx + VoxCeleb2 91 58.82 Pt
ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 420 47.89 Pt

Citation

If you find this work useful in your research, please cite the paper:

@inproceedings{yeo2024visual,
  title={Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper},
  author={Yeo, Jeong Hun and Kim, Minsu and Watanabe, Shinji and Ro, Yong Man},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={10471--10475},
  year={2024},
  organization={IEEE}
}

Acknowledgement

This project is based on the auto-avsr code. We would like to acknowledge and thank the original developers of auto-avsr for their contributions and the open-source community for making this work possible.

auto-avsr Repository: auto-avsr GitHub Repository

About

Visual Speech Recognition For Low-Resource Languages with Automatic Labels


Languages

Language:Python 98.6%Language:Shell 1.4%