We want to predict people's voice from their face.
Particularly, we use a 64x64 RGB face image to predict (generate) the mel spectrogram of corresponding voice.
The mel spectrogram can be transformed into raw audio (around 0.5 second).
using the VoxCeleb 1 dataset
We came up with two different architectures, which are autoencoder-based and GAN-based, to achieve our goal.
However, only the GAN-based model works.
For the detailed architecture and training hyperparameters, please refer to the code.
This architecture is adapted from the model in Speech2Face [1].
We use deep convolutional GAN, also known as DCGAN, with face images as the condition, to generate the mel spectrogram.
In addition to the raw audio, we also attempted to combine it with existing TTS technology - Real-Time-Voice-Cloning [2].
The files in the following folders are the results of our experiments. All single numbered files are the generated raw audios, and double numbered files are the corresponding speech audios output from TTS.
[1] Speech2Face: Learning the Face Behind a Voice
[2] Real-Time-Voice-Cloning