Question about the book.

Question

Question about the book.

mercuito opened this issue 2 years ago · comments

I found your repository https://github.com/r9y9/wavenet_vocoder after doing some searching online for a starting point for getting something like Respeecher has (https://www.respeecher.com/). Specifically, not using text as the main input for speech synthesis, but the voice itself, as there is a lot of extra information that gets lost in simply translation to text.

I've also found your talk on breaking down the WaveGAN approach (https://www.youtube.com/watch?v=BZxqf-Wkhig&t=330s). I found that really helpful and insightful, so I wanted to thank you for that.

So, the question I wanted to ask is, would this book be a good starting point for trying to get a recipe/workflow for training a network on a speaker's voice, then using the user's input voice as a guide for the synthesized speech? Could you recommend a good starting point?

The ParallelWaveGAN (https://kan-bayashi.github.io/ParallelWaveGAN/) seems like the closest thing I could find in getting what I wanted, but it seems to be more oriented around TTS, and I couldn't get the voice conversation to actually work with anything other than the ground truth samples from the trained data.

Anyways, feel free to delete this, I just didn't know of a good way to contact you with the questions I have.

Thanks!

Ryuichi Yamamoto · Answer 1 · Mon Jan 10 2022 14:27:26 GMT+0800 (China Standard Time)

Hi, thank you for your questions.

So, the question I wanted to ask is, would this book be a good starting point for trying to get a recipe/workflow for training a network on a speaker's voice, then using the user's input voice as a guide for the synthesized speech?

Our book focuses on speech synthesis. So it might not be the best for you if you want to learn voice conversion. However, I think you might find our book useful because there are technical overlaps between voice conversion and speech synthesis. For example, the WaveNet vocoder is used as a neural vocoder in speech synthesis, while it is also used for voice conversion: https://arxiv.org/abs/2003.11750

Could you recommend a good starting point?

It depends on your background and skills for machine learning and speech processing. There's no single answer but here are my general suggestions:

If you are a beginner: Choose some books and read them. Our book can be a good starting point if you are familiar with machine learning but not with speech processing.
If you are a practitioner: https://github.com/kan-bayashi/ParallelWaveGAN and https://github.com/espnet/espnet will be a good starting point
If you have a background in machine learning and speech processing: You can search papers using google scholar and read papers if you'd be interested.

mercuito · Answer 2 · Mon Jan 10 2022 15:39:27 GMT+0800 (China Standard Time)

That paper is excellent, very relevant, thank you for sharing it!
I could have been more precise about what I was looking for, but you seemed to have gotten the right idea.
I'm looking for research/discussion about voice conversion systems that use non-parallel data e.g just the target speaker's voice samples. Along with various techniques used to make up for not having speaker data. This paper is a great start, I just wish I had the knowledge to find these kinds of papers, so thank you for your time, and providing the link.