wangyang199609 / VoViT

VoViT: Low Latency Graph-based Audio-Visual VoiceSeparation Transformer

Home Page:https://ipcv.github.io/VoViT/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VoViT: Low Latency Graph-based Audio-Visual VoiceSeparation Transformer

Project Page
Arxiv Paper

Paper Under review. Code will be uploaded upon paper acceptance.

Citation

@inproceedings{montesinos2022vovit,
  title={VoVIT: Low Latency Graph-Based Audio-Visual Voice Sseparation Transformer},
  author={Montesinos, Juan F. and Kadandale, Venkatesh S. and Haro, Gloria},
  booktitle={Arxiv preprint arXiv:2203.04099},
  year={2022}
}

Installation

Download the repo and the face landmark extractor library:

https://github.com/JuanFMontesinos/VoViT
cd VoViT
git clone https://github.com/cleardusk/3DDFA_V2
cd 3DDFA_V2
sh ./build.sh
cd ..

Requirements

The core computations (the model itself) depends on python, pytorch, einops and torchaudio. To run demos and visualizations many other libraries are required.

Note: Currently, only the offline computation is supported in a user-friendly way.

In case of incompatibilities due to future updates, the tested commit is:
https://github.com/cleardusk/3DDFA_V2/tree/1b6c67601abffc1e9f248b291708aef0e43b55ae

Running a demo

Demos are located in the demo_samples folder.
Running on interview.mp4 example:

python preprocessing_interview.py
python inference_interview.py

Latency

Preprocessing Inference Preprocessing + Inference
Graph Network Whole model
VoViT-s1 17.95 4.50 52.21 82.18
VoViT 17.95 4.55 57.45 93.31
VoViT-s1 fp16 10.94 2.88 30.47 52.43
VoViT fp16 10.94 2.86 34.18 46.14

Latency estimation for the different variants of VoViT. Average of 10 runs, batch size 100. Device: Nvidia RTX 3090. GPU utilization >98%, memory on demand. Two forward passed done to warm up. Timing corresponds to ms to process 10s of audio

Note: Pytorch version is no longer supporting complex32 dtype in pytorch 1.11

About

VoViT: Low Latency Graph-based Audio-Visual VoiceSeparation Transformer

https://ipcv.github.io/VoViT/


Languages

Language:Python 91.1%Language:C 8.9%Language:Shell 0.1%