marcromani / voice2voice

Parallel data voice conversion based on pix2pix

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

voice2voice

Parallel data voice conversion based on the pix2pix architecture.

License: MIT

Summary

Non-conditional GAN system (neither the generator nor the discriminator are conditioned) based on the pix2pix architecture. The aim is to reconstruct the speech of a source speaker with the voice of a target speaker. The models are not conditioned because it is not possible to learn a meaningful mapping given the (non-linear) audio misalignments due to, for example, source and target speakers speaking at different speeds.

Data

We trained and tested the system with the Voice Conversion Challenge 2018 data. For a (source, target) pair of audio samples (from two different people uttering the same speech) we compute their Mel spectograms so that each one of them is a single-channel 256x256 image. These are the inputs of both the generator and the discriminator.

Source Target

Note how the data is misaligned. The speakers have a different cadence while speaking. Sometimes there's even a pause in one of the samples but not in the other. Click on the image to download the audio.

Details

The architecture and training hyperparameters are the same as in the original paper, but we replaced the batch normalization layers by instance normalization layers both in the generator and the discriminator, as suggested here. Also, we use mean squared error as the adversarial loss, as suggested here.

Dependencies

Examples

Source Target Fake
Source Target Fake
Source Target Fake

About

Parallel data voice conversion based on pix2pix

License:MIT License


Languages

Language:Jupyter Notebook 100.0%