SpeakerDiff is a novel versatile probabilistic model that generates high-quality speech samples for Libre Speech-based speaker embeddings. We have demonstrated the effectiveness of the denoising diffusion probabilistic method in preserving the feature information in the speech while effectively anonymizing identifying information.
Denoising diffusion probabilistic models (DDPM), a promising class of generative models that gradually uses a Markov chain to convert isotropic Gaussian distribution into complex data distribution. The diffusion models serve to balance the trade-off between flexibility and traceability. We remodel the diffusion model proposed by Jonathan Ho et al. by modifying the variance scheduler and employing the entire mechanism on speaker embedding. Diffusion models operate on a noise-adding schedule without learning from the parameters to obtain salient features.
Experiments run on 3.3Ghz AMD EPYC 7002 series. Requires Python 3.8, and these dependencies for CPU instances, Please install 'requirements.txt'
pip3 install -r requirements.txt
There are 3 types of embeddings generated from LibreSpeech Corpus:
- 64 Dimensional, which has 19k samples
- 128 Dimensional, which has 49k samples
- 704 Dimensional, which has 5k samples
python3 main.py
Linear and UNet Models are written in a model.py file, which can be modified as per the requirement, Following is UNET model architecture.
These audio samples are generated after passing the generated embeddings to a TTS Engine.
Female Voice:
Male Voice:
Red points represent the original data points in the distribution and blue ones are generated data points. The t-Sne plot of Generated and Original data points.
[1] Denoising Diffusion Probalistic Models
[2] Denoising Diffusion Implicit Models
[3] Diffusion Model Beats GANs on Image Synthesis