Results on 'continuous' speech (recommendations needed too)

Question

Results on 'continuous' speech (recommendations needed too)

jvel07 opened this issue 5 years ago · comments

Hi, I just wanna share the results obtained so far when training wavegan with 'continuous' speech.
Description of the data: the dataset consists of 1000 wavs varying from 2 secs to 10 secs long, from different speakers. In each wav, one speaker says one arbitrary short phrase (in German language).
I trained the model with: data_slice_len=32768, wave_gan_dim=32; trained it for 89000 iterations. The results look promising, have to be improved tho. There exists some noise and the voice is not clear enough yet (a bit 'robotized'). @chrisdonahue what would be your suggestion in this case?

Here is where you can listen to the generated audio:
https://soundcloud.com/jvel07/sets/wave-gans-generated-speech

Synapse · Answer 1 · Thu Feb 06 2020 16:09:13 GMT+0800 (China Standard Time)

Near 200k iterations and I couldn't get rid of the 'robotized' voice. Somebody has any suggestions, please? :)
@chrisdonahue @andimarafioti

Andrés Marafioti · Answer 2 · Thu Feb 06 2020 16:34:55 GMT+0800 (China Standard Time)

What sampling rate are you using? Are you training and generating on different lenghts (2-10 secs) or is the length fixed? To what? Your dataset sounds more complex than the one used in the wavegan experiments, so your results are not completely unexpected.

Synapse · Answer 3 · Thu Feb 06 2020 17:05:06 GMT+0800 (China Standard Time)

Thanks, @andimarafioti, for your answer. Indeed, I am aware of the fact that my dataset is different from the ones used in your experiments in the way that I inputted variable-length recordings (I read that it is preferable to use fixed lengths, but I wanted to try it out tho). Sample rate is 16k.

Silvia Pagliarini · Answer 4 · Thu Feb 06 2020 17:28:27 GMT+0800 (China Standard Time)

Which options are you using for the data? E.g. --data_first_slice, --data_pad_end, and so on.

Synapse · Answer 5 · Thu Feb 06 2020 17:30:53 GMT+0800 (China Standard Time)

Hi, @spagliarini . Those are as follows:
data_fast_wav,True
data_first_slice,False
data_normalize,False
data_num_channels,1
data_overlap_ratio,0.0
data_pad_end,False
data_prefetch_gpu_num,0
data_sample_rate,16000
data_slice_len,32768

Andrés Marafioti · Answer 6 · Fri Feb 07 2020 17:55:21 GMT+0800 (China Standard Time)

I'm sorry jvel07, I can't really help you much further with this since I'm not so familiar with the wavegan code and results, if you want to try our project that tackles the same problem as wavegan but using a different representation of sound I could help further (https://github.com/tifgan/stftGAN). For what is worth, I would try with smaller slices, maybe half of what you have.

Synapse · Answer 7 · Fri Feb 07 2020 18:17:05 GMT+0800 (China Standard Time)

@andimarafioti I understand, thanks anyways for your suggestions. Let's wait for this to reach @chrisdonahue
Regarding your repo, I haven't used TF representation before. Can they be used to extract, e.g., MFCCs from them?

Andrés Marafioti · Answer 8 · Fri Feb 07 2020 18:41:59 GMT+0800 (China Standard Time)

Actually, MFCC is a TF representation. TF just stands for Time-frequency, as your representation has two dimensions, one for time and another one for frequency.

Chris Donahue · Answer 9 · Mon Feb 10 2020 04:34:29 GMT+0800 (China Standard Time)

@jvel07 the samples you linked are not too far from what I've heard from training WaveGAN on more complex speech datasets (see our paper results on the TIMIT dataset).

One thing that might improve things is increasing the model dimensionality from 32 to 64 (or larger).

You could try adding a post-processing filter with --wavegan_genr_pp which might help with the noise. You might also consider training for longer (I usually trained for 200k iterations or so).

The data loader settings you linked seem fine.

Synapse · Answer 10 · Tue Feb 11 2020 17:50:04 GMT+0800 (China Standard Time)

Thanks, @chrisdonahue. Will try that out, about the length of that filter, may I keep it on 512?