YasserdahouML / visper

ViSpeR: Multilingual Audio-Visual Speech Recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ViSpeR: Multilingual Audio-Visual Speech Recognition

This repository contains ViSpeR, a large-scale dataset and models for Visual Speech Recognition for English, Arabic, Chinese, French and Spanish.

Dataset Summary:

Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale.

Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset.

Dataset French (fr) Spanish (es) Arabic (ar) Chinese (zh)
MuAVIC 176 178 16 --
VoxCeleb2 124 42 -- --
AVSpeech 122 270 -- --
ViSpeR (TedX) 192 (160k) 207 (151k) 49 (48k) 129 (143k)
ViSpeR (Wild) 680 (481k) 587 (383k) 1152 (1.01M) 658 (593k)
ViSpeR (full) 872 (641k) 794 (534k) 1200 (1.06M) 787 (736k)

Downloading the data:

First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows:

Languages Split
French train, test_tedx, test_wild
Spanish train, test_tedx, test_wild
Chinese train, test_tedx, test_wild
Arabic training coming soon, test_tedx, test_wild
Data/
├── Chinese/
│ ├── video_id.mp4
│ └── ...
├── Arabic/
│ ├── video_id.mp4
│ └── ...
├── French/
│ ├── video_id.mp4
│ └── ...
├── Spanish/
│ ├── video_id.mp4
│ └── ...

Setup:

  1. Setup the environement and repo:
conda create --name visper python=3.10
conda activate visper
git clone https://github.com/YasserdahouML/visper
cd visper
  1. Install fairseq within the repository:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ..
  1. Install PyTorch (tested pytorch version: v2.2.2) and other packages:
pip install torch torchvision torchaudio
pip install pytorch-lightning
pip install sentencepiece
pip install av
pip install hydra-core --upgrade
  1. Install ffmpeg:
conda install "ffmpeg<5" -c conda-forge

Processing the data:

You need the download the meta data from Huggingface🤗, this includes train.tar.gz and test.tar.gz. Then, use the provided metadata to process the raw data for creating the ViSpeR dataset. You can use the crop_videos.py to process the data, note that all clips are cropped and transformed

Languages Split
French train, test
Spanish train, test
Chinese train, test
Arabic Training coming soon, test
python data_prepare/crop_videos.py --video_dir [path_to_data_language] --save_path [save_path_language] --json_path [language_metadata_path] --use_ffmpeg True
ViSpeR/
├── Chinese/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── Arabic/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── French/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...
├── Spanish/
│ ├── video_id/
│ │  │── 00001.mp4
│ │  │── 00001.json
│ └── ...

The video_id/xxxx.json has the 'label' of the corresponding video video_id/xxxx.mp4.

For english, you can refer to LRS3, and VoxCeleb-en

Multilingual ViSpeR

The processed multilingual VSR video-text pairs are utilized to train a multilingual encoder-decoder model in a fully-supervised manner. The supported languages are: English, Arabic, French, Spanish and Chinese. For English, we leverage the combined 1759H from LRS3 and VoxCeleb-en. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned on all languages and have a vocabulary size of 21k. Results are presented here:

Language VSR (WER/CER) AVSR (WER/CER)
French 29.8 5.7
Spanish 39.4 4.4
Arabic 47.8 8.4
Chinese 51.3 (CER) 15.4 (CER)
English 49.1 8.1

Model weights to be found at Huggingface🤗

Languages Task Size Checkpoint
en, fr, es, ar, zh AVSR Base visper_avsr_base.pth
en, fr, es, ar, zh VSR Base visper_vsr_base.pth

Evaluation

Run evaluation on the videos using

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py \
ckpt_path=visper_vsr_base.pth \
data.modality=video infer_path=/path/to/files.npy \
infer_lang=[LANG]

For evaluating using the AVSR model, modify data.modality=audiovisual and ckpt_path=visper_avsr_base.pth above. [LANG] should be set to one of the five languages (arabic, chinese, french, spanish or english).

To test on English, please get the data from here WildVSR-en

Intended Use

This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models.

Limitations and Biases

Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process.

Acknowledgement

This repository is built using the espnet, fairseq, auto_avsr and avhubert repositories.

Citation

@article{narayan2024visper,
  title={ViSpeR: Multilingual Audio-Visual Speech Recognition},
  author={Narayan, Sanath and Djilali, Yasser Abdelaziz Dahou and Singh, Ankit and Bihan, Eustache Le and Hacid, Hakim},
  journal={arXiv preprint arXiv:2406.00038},
  year={2024}
}

Check our VSR related works

@inproceedings{djilali2023lip2vec,
  title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13790--13801},
  year={2023}
}

@inproceedings{djilali2024vsr,
  title={Do VSR Models Generalize Beyond LRS3?},
  author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={6635--6644},
  year={2024}
}

About

ViSpeR: Multilingual Audio-Visual Speech Recognition

License:Other


Languages

Language:Python 99.8%Language:Shell 0.2%