bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"

Home Page:https://bshall.github.io/UniversalVocoding/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

num_steps of training for those demo sample?

bayesrule opened this issue · comments

Hi,

This repo is really great. May I ask the number of training steps (with batch_size 32) required for your demo samples? Given the amount of training data used here (around 26 hours recordings), I guess the 100k num_steps as provided in the config.json is not enough, right?

Many thanks!

Hi @bayesrule,

Thanks! The audio on the demo page is generated with the pretrained model I uploaded which was only trained for 100k steps. I was also surprised by how quickly it trains. You get intelligible samples by 20k steps and decent results by 60k-80k steps.

I've noticed that generated audio for the out-of-domain speakers are a bit noisy. I'm not sure if longer training times would help with that or if it is a limitation with the ZeroSpeech dataset (which is pretty noisy).

Hi @bshall,

I was also surprised by how quickly it trains.
Could you share some data points w.r.t. absolute training time vs. corpus size and hardware used?
Im building a TTS prototype based on Tacotron and am looking for a vocoder with better quality than GL but less training effort than required e.g. by Wavenet.
Thanks!

Hi @te0006
I share my results.
I am grad if it is good for you.

https://tarepan.github.io/UniversalVocoding/

Dataset: total 10 hours utterances
Machine: Google Colab T4
others: in GitHub Pages

In my impression, RNN_MS is surprisingly fast and robust.

Hello, thanks for replying so quickly.

For such a short training run (5hrs/60ksteps) your results certainly sound impressive.

I think training time is often neglected in publications even though it is critically important for people looking to integrate/adapt a method, where you want to be able to try and fiddle with parameters without prohibitive computational cost.

BTW your last, English sound example seems to exhibit considerably more noise and distortion than the Japanese ones (but perhaps, not speaking the language and thus not being used to hearing it, I simply cannot hear the artifacts in the Japanese examples).

Do you already have experience w.r.t how far (and how fast) the speech quality improves with more training time?

Many reproducible experiments (including this repository) kindly give information of training time. I agree with you and hope papers itself give the information too.

Your hearing is correct.
Out of domain En utterance is more noisy.
In my opinion, it is because of language difference.
English coutain phonemes which are not in Japanese.

Not yet, but I will.