thevasudevgupta / gsoc-wav2vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Home Page:https://thevasudevgupta.github.io/gsoc-wav2vec2/assets/final_report

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about processor

ahmedlone127 opened this issue · comments

what does this code do :

def _normalize(self, x):
        """You must call this before padding."""
        # -> (1, seqlen)
        mean = tf.reduce_mean(x, axis=-1, keepdims=True)
        var = tf.math.reduce_variance(x, axis=-1, keepdims=True)
        return tf.squeeze((x - mean) / tf.sqrt(var + 1e-5))

my other question is on what basis are numbers assigned to the vocab list by that i mean this :
image

I understand the code in the picture it basically gets all the characters from the text but my question is when it turns the characters into a dictionary with the values as their index does it matter what character is at what index and if yes then how does the right character get at the right index. I was trying to test my version of your tokenizer and I had trouble producing the right outputs with your vocab.json so I went and took the one here which worked fine.Also i was using a fine-tuned model for making predictions which was associated with this tokenizer via hugging face

Hi @ahmedlone127,

Thanks for your interest in this project!!

def _normalize(self, x):
        """You must call this before padding."""
        # -> (1, seqlen)
        mean = tf.reduce_mean(x, axis=-1, keepdims=True)
        var = tf.math.reduce_variance(x, axis=-1, keepdims=True)
        return tf.squeeze((x - mean) / tf.sqrt(var + 1e-5))

Wav2Vec2 was trained after normalising speech along time axis. So this code is allowing that functionality. In my repository, Wav2Vec2Processor has 2 different functionality- one handles preprocessing of speech (when is_tokenizer=False) & other handles post processing of model outputs (i.e decoding logits into string) (when is_tokenizer=True). So, above code is relevant to instance created by setting is_tokenizer=False. You can refer this notebook for better understanding.

my other question is on what basis are numbers assigned to the vocab list by that i mean this :

This vocabulary file is getting used (https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/data/vocab.json) for de-tokenizing. This file has been taken from pre-trained Wav2Vec2 model directly.

Hoping this would help!!

hey thanks for the answer I just ran the notebook you attached and looks like some of the stuff needs to be updated

I just fixed it now. Can you try running that notebook again?

yeah looks good ! thanks , also why do you specify axis =-1 and keepdims = True

I was trying to duplicate this to scala and this is what i got uptill now :

  def mean(list:List[Double]):Double = if(list.isEmpty) 0 else list.sum/list.size
  def variance(xs: Seq[Double]): Option[Double] = {
    mean(xs).flatMap(m => mean(xs.map(x => Math.pow(x-m, 2))))
  }


it's for the first two lines , do they look good to you I am anxious casue i don't understand what keepdims= True and axis =-1 mean casue i am probably not adding their functionality inside this function

I am axis=-1 to make sure normalization is happening along time dimension. keepdims=True will help us keep the nD array as output if input is nD array.

I would encourage you to print out outputs of these statements to understand them better. Since, I am not familiar with scala, I am not sure if your code is correct or wrong.

okay thanks !

okay so i am pretty much done with verifying the outputs even though i couldn't implent axis=-1 it looked identical with alot more precision , I want to ask why do we call tf.transpose here even though the output after and before calling it is pretty much the same

image

Hey, sorry for late reply. You can avoid tf.transpose if everything looking alright without it.

Closing this issue as everything is resolved. Please create a new issue in case you wanna discuss something.