RF5 / simple-speaker-embedding

A speaker embedding network in Pytorch that is very quick to set up and use for whatever purposes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ConvGRU Design, Dataset Size

kradonneoh opened this issue · comments

Hey!

I had a few questions regarding the choices made when designing the ConvGRU network and getting your thoughts extensions to the dataset.

For the ConvGRU network, why did you decide to go with raw waveforms as opposed to Log Scale Mel Spectrograms (which often seems to be the first choice for convolutional style embedding networks)? Did you experiment with both and find the raw waveforms to be better? Also, did you ever try a fully convolutional approach / with a transformer or self-attention blocks?

In terms of the dataset, did you ever try multi-lingual data in addition to english data? I'm wondering if the addition of content that isn't english will help the model ignore content more than it does already

Hi!

For your questions:

  • Why convolutional encoder from raw waveforms instead of mel-spectrograms: I did this to follow on the trend seen in self-supervised representation models (wav2vec, HuBERT, WavLM....) which seem to find better representations by learning a convolution encoder directly from the waveform. Conceptually I like it since it is simpler and more end-to-end, baking everything into the model itself. The difference between using mel-spectrogram input and learning a convolutional encoder I have not precisely measured, but I suspect the convolutional encoder will be slightly better.
  • I never tried a fully convolutional encoder + transformer main network. I suspect this might actually work better than the GRU that I currently have available. Feel free to try train one with a transformer main network and I suspect the results might be better.
  • I have not trained on a multi-lingual dataset. Ideally it should be trained on multi-lingual data with as many languages and speakers as possible. This is primarily a resource limitation for me -- if you have the data & compute, I am almost certain the results will be improved.

Hope that helps!

Closing for now since this seems inactive, feel free to open if any more questions.

Thanks for response! I did have a few more questions about training and implementation:

  1. How long did you train the ConvGRU model for and on what hardware -- do you think it could benefit from more iterations or did the validation loss plateu?
  2. If a user provides multiple utterances at inference time, is there better performance (by EER) by averaging the predicted embeddings, or not much compared to one-shot?

(I couldn't find a way to re-open the issue, so I'm hoping you'll still get a notification for this)

Ahh sure thing:

  1. 700k updates (as in the checkpoint filename) on 1x 2070 SUPER gpu over a couple of weeks. I found the validation loss did more or less plateau at this stage, but it was not increasing, so it might still get a slight benefit from training longer?
  2. Typically yes, the speaker embedding is more stable if you average it over several utterances. However, I did not find this to be the case for all speakers, and it depends on the nature of each utterance (e.g. if one utterance is mostly shouting while the others are talking normally, the mean speaker embedding might be wonky). I didn't do much detailed measurements on one-shot vs few-shot averaging though.

Hope that helps!

Closing again since it seems inactive, feel free to open if any more questions.