ViSpeR: Multilingual Audio-Visual Speech Recognition

This repository contains ViSpeR, a large-scale dataset and models for Visual Speech Recognition for English, Arabic, Chinese, French and Spanish.

Dataset Summary:

Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale.

Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset.

Dataset	French (fr)	Spanish (es)	Arabic (ar)	Chinese (zh)
MuAVIC	176	178	16	--
VoxCeleb2	124	42	--	--
AVSpeech	122	270	--	--
ViSpeR (TedX)	192 (160k)	207 (151k)	49 (48k)	129 (143k)
ViSpeR (Wild)	680 (481k)	587 (383k)	1152 (1.01M)	658 (593k)
ViSpeR (full)	872 (641k)	794 (534k)	1200 (1.06M)	787 (736k)

Downloading the data:

First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows:

Languages	Split
French	train, test_tedx, test_wild
Spanish	train, test_tedx, test_wild
Chinese	train, test_tedx, test_wild
Arabic	training coming soon, test_tedx, test_wild

Data/
├── Chinese/
│ ├── video_id.mp4
│ └── ...
├── Arabic/
│ ├── video_id.mp4
│ └── ...
├── French/
│ ├── video_id.mp4
│ └── ...
├── Spanish/
│ ├── video_id.mp4
│ └── ...

Setup:

Setup the environement and repo:

conda create --name visper python=3.10
conda activate visper
git clone https://github.com/YasserdahouML/visper
cd visper

Install fairseq within the repository:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ..

Install PyTorch (tested pytorch version: v2.2.2) and other packages:

pip install torch torchvision torchaudio
pip install pytorch-lightning
pip install sentencepiece
pip install av
pip install hydra-core --upgrade

Install ffmpeg:

conda install "ffmpeg<5" -c conda-forge

Processing the data:

You need the download the meta data from Huggingface🤗, this includes train.tar.gz and test.tar.gz. Then, use the provided metadata to process the raw data for creating the ViSpeR dataset. You can use the crop_videos.py to process the data, note that all clips are cropped and transformed

Languages	Split
French	train, test
Spanish	train, test
Chinese	train, test
Arabic	Training coming soon, test

python data_prepare/crop_videos.py --video_dir [path_to_data_language] --save_path [save_path_language] --json_path [language_metadata_path] --use_ffmpeg True

ViSpeR/
├── Chinese/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── Arabic/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── French/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── Spanish/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...

The video_id/xxxx.json has the 'label' of the corresponding video video_id/xxxx.mp4.

For english, you can refer to LRS3, and VoxCeleb-en

Multilingual ViSpeR

The processed multilingual VSR video-text pairs are utilized to train a multilingual encoder-decoder model in a fully-supervised manner. The supported languages are: English, Arabic, French, Spanish and Chinese. For English, we leverage the combined 1759H from LRS3 and VoxCeleb-en. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned on all languages and have a vocabulary size of 21k. Results are presented here:

Language	VSR (WER/CER)	AVSR (WER/CER)
French	29.8	5.7
Spanish	39.4	4.4
Arabic	47.8	8.4
Chinese	51.3 (CER)	15.4 (CER)
English	49.1	8.1

Model weights to be found at Huggingface🤗

Languages	Task	Size	Checkpoint
en, fr, es, ar, zh	AVSR	Base	visper_avsr_base.pth
en, fr, es, ar, zh	VSR	Base	visper_vsr_base.pth

Evaluation

Run evaluation on the videos using

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py \
ckpt_path=visper_vsr_base.pth \
data.modality=video infer_path=/path/to/files.npy \
infer_lang=[LANG]

For evaluating using the AVSR model, modify data.modality=audiovisual and ckpt_path=visper_avsr_base.pth above. [LANG] should be set to one of the five languages (arabic, chinese, french, spanish or english).

To test on English, please get the data from here WildVSR-en

Intended Use

This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models.

Limitations and Biases

Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process.

Acknowledgement

This repository is built using the espnet, fairseq, auto_avsr and avhubert repositories.

Citation

@article{narayan2024visper,
  title={ViSpeR: Multilingual Audio-Visual Speech Recognition},
  author={Narayan, Sanath and Djilali, Yasser Abdelaziz Dahou and Singh, Ankit and Bihan, Eustache Le and Hacid, Hakim},
  journal={arXiv preprint arXiv:2406.00038},
  year={2024}
}

Check our VSR related works

@inproceedings{djilali2023lip2vec,
  title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13790--13801},
  year={2023}
}

@inproceedings{djilali2024vsr,
  title={Do VSR Models Generalize Beyond LRS3?},
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={6635--6644},
  year={2024}
}

YasserdahouML / visper