thevasudevgupta / gsoc-wav2vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Home Page:https://thevasudevgupta.github.io/gsoc-wav2vec2/assets/final_report

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training Wav2Vec2 model on 100h & experiment-2

thevasudevgupta opened this issue · comments

@sayakpaul, sorry for delay again. I have started serious experimentation now and will keep you posted with the results. I am starting with experiment-2 for now as mentioned in vasudevgupta7/compressed-wav2vec2#1. I will mention all the results in this issue by tomorrow (TPUs are running now!!)

Experiment description WER Wandb
wav2vec2-960h (Facebook version) 3% -
wav2vec2-960h (trained during gsoc) 5.6% -
wav2vec2-100h 7.4% https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/lwiepmm0
wav2vec2-100h (skipped stage-1) 8.2% https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/h0bug1zp
wav2vec2-100h (train conv also) 9.1% https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/2iro0pl0, https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/284a713r
distilled wav2vec2-100h https://wandb.ai/7vasudevgupta/wav2vec2-distillation/runs/2h82mhgc

Evaluation script: https://colab.research.google.com/drive/1aNgochNmchx1R5TcoVH7nM0uPkmxNqE1?usp=sharing

Just wanted to ask one thing: Is it fine if I code in my gsoc repository or I should code in this private repo??

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

I will check out with results tomorrow and share my comments.

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

Yeah! that would be good

@vasudevgupta7 seems like the training is now done? The training progress (loss-wise) looks good to me.

Also just for my own reference, this is in regards to distilling the wav2vec2 model fine-tuned on speech recognition, correct?

Wanted to know a bit more about the student architectures. Could you provide brief overviews?

@sayakpaul,

Above experiments are just normal fine-tuning wav2vec2 on 100h of LibriSpeech data. Since, training on 960h takes lot of time, I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data. (We will definitely train on 960h data finally, its just for cutting the experimentation time now as 100h model is also giving reasonable WER)
Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

I will post brief overviews for every training experiment (in the table) by tonight!

I am going to do distillation training today.

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data.

Perfectly fine.

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

No, I directly trained on 960h earlier.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

By 2 stages, I mean this: #17 (comment)

Gotcha. Thank you.

Hello @sayakpaul, I trained the first distillation model yesterday. Unfortunately, it didn't perform well. It's trying to learn (not all predicted tokens are random). I am trying to change initialisation strategy and some hyper parameters to get it working.

teacher: https://tfhub.dev/vasudevgupta7/wav2vec2-960h/1
student: smaller version of same architecture
loss: alpha*KL-divergence loss + (1-alpha)*(ctc-loss)
script: https://github.com/vasudevgupta7/compressed-wav2vec2/blob/part_2/src/train_distilled.py

Are you training the student for longer? How's the training progress?

What happens if we only use KL-divergence and completely get rid of the labeled signal?

Currently only for 10 epochs (logs: https://wandb.ai/7vasudevgupta/wav2vec2-distillation/runs/2h82mhgc?workspace=user-7vasudevgupta). I need to play around with alpha. Will do these experiments today.

@vasudevgupta7 I get a 404 after clicking on the above-mentioned link.

I think we need to think of an augmentation pipeline to regularize the student training so that it is able to match the underlying teacher. The FunMatch paper circumvents this by using an aggressive form of MixUp and an excessively longer training schedule to compensate for it.

Translating that to speech is difficult, I agree and this is where we have opportunities I believe. It might be worth taking a look at AugLy which is an open-source framework providing augmentation transformations for different data modalities including audio. This might help us curate an augmentation pipeline for our purpose.

On the other hand, your last thought on this comment also seems a pretty good direction. If we do try to figure out that mapping (two conv blocks from teacher = 1 conv block in the student, for example) I think we could introduce another bottleneck layer to help make that transfer learnable.