Akshat4112 / SpeakerDiff

SpeakerDiff: Denoising Diffusion Probalistic Models on Speaker Embeddings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SpeakerDiff: Speaker Embedding Generation using Denoising Diffusion Probabilistic Models

SpeakerDiff is a novel versatile probabilistic model that generates high-quality speech samples for Libre Speech-based speaker embeddings. We have demonstrated the effectiveness of the denoising diffusion probabilistic method in preserving the feature information in the speech while effectively anonymizing identifying information.

Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPM), a promising class of generative models that gradually uses a Markov chain to convert isotropic Gaussian distribution into complex data distribution. The diffusion models serve to balance the trade-off between flexibility and traceability. We remodel the diffusion model proposed by Jonathan Ho et al. by modifying the variance scheduler and employing the entire mechanism on speaker embedding. Diffusion models operate on a noise-adding schedule without learning from the parameters to obtain salient features.

Forward Process

Forward Noise

Backward Process

Backward Noise

Experiments run on 3.3Ghz AMD EPYC 7002 series. Requires Python 3.8, and these dependencies for CPU instances, Please install 'requirements.txt'

pip3 install -r requirements.txt

Dataset

There are 3 types of embeddings generated from LibreSpeech Corpus:

  1. 64 Dimensional, which has 19k samples
  2. 128 Dimensional, which has 49k samples
  3. 704 Dimensional, which has 5k samples

Training

python3 main.py

Model

Linear and UNet Models are written in a model.py file, which can be modified as per the requirement, Following is UNET model architecture.

UNet Architecture

Output Audio Samples

These audio samples are generated after passing the generated embeddings to a TTS Engine.

Female Voice:

Male Voice:

Output

Red points represent the original data points in the distribution and blue ones are generated data points. The t-Sne plot of Generated and Original data points.

Plot

References

[1] Denoising Diffusion Probalistic Models
[2] Denoising Diffusion Implicit Models
[3] Diffusion Model Beats GANs on Image Synthesis

About

SpeakerDiff: Denoising Diffusion Probalistic Models on Speaker Embeddings


Languages

Language:Python 100.0%