About Speaker Voice

Question

About Speaker Voice

shoegazerstella opened this issue 4 years ago · comments

I was playing with the preprocessing parameters and I was able to change a bit the sound of the synthesized voice.
I was wondering if there was a clever way to to do it in terms of pitch, energy, style, timbre etc..
Thanks!

Benjamin van Niekerk · Answer 1 · Wed Jul 29 2020 22:55:04 GMT+0800 (China Standard Time)

Hi @shoegazerstella,

It's fun to mess with the inputs but I think changing the speech characteristics in any systematic way is pretty difficult. I remember the issue in #3 was that changing num_fft resulted in a pitch shift. I think a more principled method would be vocal tract length perturbation (see "Vocal tract length perturbation (VTLP) improves speech recognition" for details). It's relatively easy to mess with the mel filters in librosa so that'd be a simple place to start.

Otherwise, if you're interested in changing the speaker entirely I've done some work on voice conversion here. There are also a bunch of papers/repos that convert the spectrogram directly and then synthesize with a vocoder (happy to suggest some if you're interested).

shoegazerstella · Answer 2 · Wed Jul 29 2020 23:14:53 GMT+0800 (China Standard Time)

if you're interested in changing the speaker entirely I've done some work on voice conversion here. There are also a bunch of papers/repos that convert the spectrogram directly and then synthesize with a vocoder (happy to suggest some if you're interested).

Exacly, my aim is to change the speaker entirely.

I was reading more on voice cloning and I did find these two works:

But if I understand well, your approach on voice conversion is a little bit different. I'll look more into it!
Would be awesome if you could suggest other approaches too!
Thanks a lot!

Benjamin van Niekerk · Answer 3 · Wed Jul 29 2020 23:54:57 GMT+0800 (China Standard Time)

No problem!

Well, there are two options:

Voice cloning (as you mentioned) - where you synthesize speech from a specific voice from text.
Voice conversion - where you take audio from one speaker and directly convert it to a target speaker.

I think Real-Time-Voice-Cloning the best available open-source project for voice cloning. For voice conversion, there is https://github.com/liusongxiang/StarGAN-Voice-Conversion and https://github.com/auspicious3000/autovc for example.

Hope that helps!

shoegazerstella · Answer 4 · Thu Jul 30 2020 00:08:56 GMT+0800 (China Standard Time)

So yes, the approaches are two indeed.
For the TTS part I was using an implementation of FastSpeech2 and to be honest I didn't want to change that because it's super fast in CPU.
So I might try both approaches and decide on both quality of results and speed.
Again thanks a lot! :)