Training Wav2Vec2 model on 100h & experiment-2

Question

Training Wav2Vec2 model on 100h & experiment-2

thevasudevgupta opened this issue 3 years ago · comments

@sayakpaul, sorry for delay again. I have started serious experimentation now and will keep you posted with the results. I am starting with experiment-2 for now as mentioned in vasudevgupta7/compressed-wav2vec2#1. I will mention all the results in this issue by tomorrow (TPUs are running now!!)

Experiment description	WER	Wandb
`wav2vec2-960h` (Facebook version)	3%	-
`wav2vec2-960h` (trained during gsoc)	5.6%	-
`wav2vec2-100h`	7.4%	https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/lwiepmm0
`wav2vec2-100h` (skipped stage-1)	8.2%	https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/h0bug1zp
`wav2vec2-100h` (train conv also)	9.1%	https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/2iro0pl0, https://wandb.ai/7vasudevgupta/gsoc-wav2vec2/runs/284a713r
distilled `wav2vec2-100h`		https://wandb.ai/7vasudevgupta/wav2vec2-distillation/runs/2h82mhgc

Evaluation script: https://colab.research.google.com/drive/1aNgochNmchx1R5TcoVH7nM0uPkmxNqE1?usp=sharing

Just wanted to ask one thing: Is it fine if I code in my gsoc repository or I should code in this private repo??

Sayak Paul · Answer 1 · Sun Sep 19 2021 01:53:21 GMT+0800 (China Standard Time)

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

I will check out with results tomorrow and share my comments.

Vasudev Gupta · Answer 2 · Sun Sep 19 2021 01:54:46 GMT+0800 (China Standard Time)

I think to keep it tidy we could use this repo and once we have fixated on something we could incorporate that inside the GSoC repo. WDYT?

Yeah! that would be good

Sayak Paul · Answer 3 · Sun Sep 19 2021 10:01:29 GMT+0800 (China Standard Time)

@vasudevgupta7 seems like the training is now done? The training progress (loss-wise) looks good to me.

Also just for my own reference, this is in regards to distilling the wav2vec2 model fine-tuned on speech recognition, correct?

Wanted to know a bit more about the student architectures. Could you provide brief overviews?

Vasudev Gupta · Answer 4 · Sun Sep 19 2021 14:23:53 GMT+0800 (China Standard Time)

@sayakpaul,

Above experiments are just normal fine-tuning wav2vec2 on 100h of LibriSpeech data. Since, training on 960h takes lot of time, I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data. (We will definitely train on 960h data finally, its just for cutting the experimentation time now as 100h model is also giving reasonable WER)
Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

I will post brief overviews for every training experiment (in the table) by tonight!

I am going to do distillation training today.

Sayak Paul · Answer 5 · Sun Sep 19 2021 14:43:56 GMT+0800 (China Standard Time)

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

Further, since experiments involve 2 stage training, I wanted to check if we can follow only stage-1 for further experimentation.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

I want to establish some kinda baseline for small amount of data so that further experiments can be started on small data.

Perfectly fine.

Vasudev Gupta · Answer 6 · Sun Sep 19 2021 14:46:34 GMT+0800 (China Standard Time)

Got it. But didn't we have models fine-tuned on the LibriSpeech dataset (100h) already?

No, I directly trained on 960h earlier.

By two-stage, do you mean training of both student and teacher models? In any case, I think when it's applicable we should be able to use the pre-trained (fine-tuned) models as teachers.

By 2 stages, I mean this: #17 (comment)

Sayak Paul · Answer 7 · Sun Sep 19 2021 14:48:22 GMT+0800 (China Standard Time)

Gotcha. Thank you.

Vasudev Gupta · Answer 8 · Mon Sep 20 2021 16:18:12 GMT+0800 (China Standard Time)

Hello @sayakpaul, I trained the first distillation model yesterday. Unfortunately, it didn't perform well. It's trying to learn (not all predicted tokens are random). I am trying to change initialisation strategy and some hyper parameters to get it working.

teacher: https://tfhub.dev/vasudevgupta7/wav2vec2-960h/1
student: smaller version of same architecture
loss: alpha*KL-divergence loss + (1-alpha)*(ctc-loss)
script: https://github.com/vasudevgupta7/compressed-wav2vec2/blob/part_2/src/train_distilled.py

Sayak Paul · Answer 9 · Mon Sep 20 2021 16:20:06 GMT+0800 (China Standard Time)

Are you training the student for longer? How's the training progress?

What happens if we only use KL-divergence and completely get rid of the labeled signal?

Vasudev Gupta · Answer 10 · Mon Sep 20 2021 16:21:37 GMT+0800 (China Standard Time)

Currently only for 10 epochs (logs: https://wandb.ai/7vasudevgupta/wav2vec2-distillation/runs/2h82mhgc?workspace=user-7vasudevgupta). I need to play around with alpha. Will do these experiments today.

Sayak Paul · Answer 11 · Mon Sep 20 2021 19:13:16 GMT+0800 (China Standard Time)

@vasudevgupta7 I get a 404 after clicking on the above-mentioned link.

I think we need to think of an augmentation pipeline to regularize the student training so that it is able to match the underlying teacher. The FunMatch paper circumvents this by using an aggressive form of MixUp and an excessively longer training schedule to compensate for it.

Translating that to speech is difficult, I agree and this is where we have opportunities I believe. It might be worth taking a look at AugLy which is an open-source framework providing augmentation transformations for different data modalities including audio. This might help us curate an augmentation pipeline for our purpose.

On the other hand, your last thought on this comment also seems a pretty good direction. If we do try to figure out that mapping (two conv blocks from teacher = 1 conv block in the student, for example) I think we could introduce another bottleneck layer to help make that transfer learnable.