This is the code for silent speech interface paper "Creating Song from Lip and Tongue Videos with a Convolutional Vocoder".
To start with: please make out/ data/ log/ dir as follow.
mkdir out/ data/ log/
We public our preprocessed dataset. You can download data from here and move all .tar.gz files to ./data/
Then unzip all files by:
cd SSI_DL/data
tar -xzf songs_audio.tar.gz
tar -xzf resize_lips.part1.tar.gz
tar -xzf resize_lips.part1.tar.gz
mkdir ../out/resize_lips
mv part_*/* ../out/resize_lips
tar -xzf resize_tongue.part1.tar.gz
tar -xzf resize_tongue.part1.tar.gz
mkdir ../out/resize_tongue
mv part_*/* ../out/resize_tongue
Downsample audio and EGG to 16khz and 10.025khz
python utils/downsample.py
python utils/audio2lsf.py
python extract_f0.py
Train a cnn model to predict LSF coefficients from lips and tongue images.
python train_lsf.py
python train_f0.py
python train_uv.py
Learning audio from LSF, F0 and U/V flat.
python train_cnn_vocoder.py
Sythesis audios from lips and tongue images.
python test_cnn_vocode.py
If it is helpful, please cite:
@article{zhang2021creating,
title={Creating song from lip and tongue videos with a convolutional vocoder},
author={Zhang, Jianyu and Roussel, Pierre and Denby, Bruce},
journal={IEEE Access},
volume={9},
pages={13076--13082},
year={2021},
publisher={IEEE}
}