bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"

Home Page:https://bshall.github.io/UniversalVocoding/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why the embedding layer instead of the one-hot audio vector?

ivancarapinha opened this issue · comments

Hello,

In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.

So, I have two (actually three) questions:

  1. What is the purpose of the one-hot audio vector in the original implementation?
  2. Why did you replace the one-hot vector with an embedding layer? What changed in the model behavior with this replacement?

Thank you very much

Hi @ivancarapinha,

Sorry about the delay!

Yeah, the paper is very vague about the model details. You're correct that the one-hot representation is related to the 10-bit audio. Basically they apply mu-law companding to the original 16-bit audio. Then you form a one-hot representation for each sample where the 1 is at the index given by the mu-law companding. This is then fed into the autoregressive part of the model.

I used an embedding layer just to make the model a bit more efficient. The first operation in a GRU is a matrix multiplication with the input. So using a one-hot input picks out a column of the matrix (basically what an embedding layer does). I just separated out the embedding operation and used a smaller dimension which hopefully sped things up training a little. It should work fine if you go with the original approach though.

Hope that helps.