fatchord / WaveRNN

WaveRNN Vocoder + TTS

Home Page:https://fatchord.github.io/model_outputs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speed up quick_start.py by running it with GPU

gerbill opened this issue · comments

commented

Hello!
I've been able to successfully generate audio files with quick_start.py, but using CPU is pretty slow. If I use CUDA it's still not much faster and uses my GPU for about 5%. I am assuming I could speed up audio files generation by at least 10 times.
Is there an easy solution for this?
My experience in RNN or ML is pretty low :(
My GPU is GeForce RTX 2060 if that helps.
Thank you!

Facing the same problem. RNNs can't be parallelized because of their sequential architecture, therefore using a GPU won't increase inference time.

commented

@zirlman , I see that when I'm using CPU, it's 100% loaded, all 6 cores, so there is some degree of parallelization going on I suppose..

Btw, one more question (sorry for the offtopic), but I've noticed that when I generate audio files the speech quality is much lower than what I hear in samples, even if I use the very same text. Could you please advise what settings I might use to increase voice quality?
Thank you!

@gerbill , @zirlman

Did you set the voc_gen_batched=True in your hparams.py?

If so, the inference speed of WaveRNN should be fast.
I got 1700 samples/sec inference speed when I set voc_gen_batched=False

voc_gen_batched=True allows to split single utterance into multiple segments and concatenate these segments like batched data for parallel synthesizing.

But this feature is trade-off feature.

There are two options that helps to make synthesizing more parallel
voc_target decides how many samples to use in one segment
and voc_overlap decides how many samples are overlapped between these segments

For what I experienced, If I set voc_target to lower, inference speed is much faster, but the synthesized audio quality becomes worse.

@gerbill
There are so many variables what makes your synthesized audio quality worse.
Less-trained TTS model, Less-trained WaveRNN model, etc.
Could you uploads some more informations?
(training steps of each models, hparams.py, etc.)