alexbrx/emo-clone

Disentangled Representation Learning and Generative Adversarial Networks for Emotional Voice Cloning

Given a recorded speech sample we would like to generate new samples having some qualitative aspects like speaker's voice timbre, prosody, emotion etc. altered.
Naive application of state-of-the-art GANs for image style transfer doesn't deliver good results because these in general are not well suited to handle sequential data like speech.
The goal is to design an adversarially trained network capable of generating high quality speech samples in our setting.

SpeechSplit is an autoencoder neural network whcih decomposes speech into disentangled latent representations corresponding to four main perceptual aspects of speech i.e. pitch, rhythm, lexical content and speaker's voice timbre.
The latents can be synthesized back into speech, hence it may be possible to perform style transfer by simply generating and substituting some of the latents to synthesize altered samples.
Authors of SpeechSplit confirm that this method works if the latents are swapped between parallel utterances (i.e. actual recordings of people uttering the same sentence).
We show that some latents can also be successfully generated by a relatively simple GAN, which introduces significant sample quality improvement over baseline end-to-end GANs in our voice style transfer task.
In other words, SpeechSplit autoencoder is used in our proposed model to simplify the structure of the data so that it can be more easily captured by a GAN.

Mel-spectrogram and pitch contour are extracted from raw waveform.
Resemblyzer (an independent neural network trained on a speaker verification task) computes speaker embedding (a vector being a high-level representation of speaker's voice) from a mel-spectrogram.
Style codes for pitch, rhythm and lexical content are provided by SpeechSplit encoder.
New latent representations for pitch and rhythm are generated by CodeGAN, whereas new speaker embeddings are sampled from VoiceGAN.
SpeechSplit decoder synthesizes output mel-spectrogram from speaker embedding and pitch, rhythm and content codes.
Mel-spectrograms are converted to output waveform by WaveGlow.

git clone https://github.com/alexbrx/emo-clone.git
cd emo-clone
bash setup.sh
pip install -r requirements.txt
python utils/fake_cvoice_samples.py

SpeechSplit is trained on VCTK and fine-tuned on IEMOCAP after which the weights become frozen.
Latent codes are computed by SpeechSplit for IEMOCAP samples on which CodeGAN is subsequently trained.
A dataset of speaker embeddings is created from Common Voice dataset on which VoiceGAN is subsequently trained.
WaveGlow vocoder is independently trained on LJSpech to convert mel-spectrograms into waveforms.