chrisdonahue / wavegan

WaveGAN: Learn to synthesize raw audio with generative adversarial networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Results on 'continuous' speech (recommendations needed too)

jvel07 opened this issue · comments

Hi, I just wanna share the results obtained so far when training wavegan with 'continuous' speech.
Description of the data: the dataset consists of 1000 wavs varying from 2 secs to 10 secs long, from different speakers. In each wav, one speaker says one arbitrary short phrase (in German language).
I trained the model with: data_slice_len=32768, wave_gan_dim=32; trained it for 89000 iterations. The results look promising, have to be improved tho. There exists some noise and the voice is not clear enough yet (a bit 'robotized'). @chrisdonahue what would be your suggestion in this case?

Here is where you can listen to the generated audio:
https://soundcloud.com/jvel07/sets/wave-gans-generated-speech

Near 200k iterations and I couldn't get rid of the 'robotized' voice. Somebody has any suggestions, please? :)
@chrisdonahue @andimarafioti

What sampling rate are you using? Are you training and generating on different lenghts (2-10 secs) or is the length fixed? To what? Your dataset sounds more complex than the one used in the wavegan experiments, so your results are not completely unexpected.

Thanks, @andimarafioti, for your answer. Indeed, I am aware of the fact that my dataset is different from the ones used in your experiments in the way that I inputted variable-length recordings (I read that it is preferable to use fixed lengths, but I wanted to try it out tho). Sample rate is 16k.

Which options are you using for the data? E.g. --data_first_slice, --data_pad_end, and so on.

Hi, @spagliarini . Those are as follows:
data_fast_wav,True
data_first_slice,False
data_normalize,False
data_num_channels,1
data_overlap_ratio,0.0
data_pad_end,False
data_prefetch_gpu_num,0
data_sample_rate,16000
data_slice_len,32768

I'm sorry jvel07, I can't really help you much further with this since I'm not so familiar with the wavegan code and results, if you want to try our project that tackles the same problem as wavegan but using a different representation of sound I could help further (https://github.com/tifgan/stftGAN). For what is worth, I would try with smaller slices, maybe half of what you have.

@andimarafioti I understand, thanks anyways for your suggestions. Let's wait for this to reach @chrisdonahue
Regarding your repo, I haven't used TF representation before. Can they be used to extract, e.g., MFCCs from them?

Actually, MFCC is a TF representation. TF just stands for Time-frequency, as your representation has two dimensions, one for time and another one for frequency.

@jvel07 the samples you linked are not too far from what I've heard from training WaveGAN on more complex speech datasets (see our paper results on the TIMIT dataset).

One thing that might improve things is increasing the model dimensionality from 32 to 64 (or larger).

You could try adding a post-processing filter with --wavegan_genr_pp which might help with the noise. You might also consider training for longer (I usually trained for 200k iterations or so).

The data loader settings you linked seem fine.

Thanks, @chrisdonahue. Will try that out, about the length of that filter, may I keep it on 512?