Training results

Question

Training results

sayakpaul opened this issue 3 years ago · comments

Thread for discussion on Training results ...

Vasudev Gupta · Answer 1 · Fri Aug 13 2021 08:22:21 GMT+0800 (China Standard Time)

I was able to fine-tune the model properly and we are getting around 5.6% WER (official model gave around 3%). We are not very far and I will try to do some hparams tuning get around 3% WER further.

Predictions looks really good🤩🤩. Checkout some :

{'prediction': 'IN DETERMINING WHETHER TWO OR MORE ALLIED FORMS OUGHT TO BE RANKED A SPECIES OR VARIETIES NATURALISTS ARE PRACTICALLY GUIDED BY THE FOLLOWING CONSIDERATIONS NAMELY THE AMOUNT OF DIFFERENCE BETWEEN THEM AND WHETHER SUCH DIFFERENCES RELATE TO FEW OR MANY POINTS OF STRUCTURE AND WHETHER THEY ARE PHYSIOLOGICAL IMPORTANCE BUT MORE ESPECIALLY WHETHER TH', 'label': 'IN DETERMINING WHETHER TWO OR MORE ALLIED FORMS OUGHT TO BE RANKED AS SPECIES OR VARIETIES NATURALISTS ARE PRACTICALLY GUIDED BY THE FOLLOWING CONSIDERATIONS NAMELY THE AMOUNT OF DIFFERENCE BETWEEN THEM AND WHETHER SUCH DIFFERENCES RELATE TO FEW OR MANY POINTS OF STRUCTURE AND WHETHER THEY ARE OF PHYSIOLOGICAL IMPORTANCE BUT MORE ESPECIALLY WHETHER THEY ARE CONSTANT'}

{'prediction': 'SHE CLOSED HER EYES AND TOOK A DEEP BREATH AS IF TO DRAW IN AGAIN THE FRAGRANCE OF THOSE DAYS', 'label': 'SHE CLOSED HER EYES AND TOOK A DEEP BREATH AS IF TO DRAW IN AGAIN THE FRAGRANCE OF THOSE DAYS'}

{'prediction': 'WESTMAR AND I WERE BACK AFTER THE FIRST ACT AND WE THOUGHT SHE SEEMED QUITE UNCERTAIN OF HER', 'label': 'WESTMERE AND I WERE BACK AFTER THE FIRST ACT AND WE THOUGHT SHE SEEMED QUITE UNCERTAIN OF HERSELF'}

{'prediction': "I REALLY DON'T THINK HE KNEW MUCH ABOUT IT MISTER HOLMES", 'label': "I REALLY DON'T THINK HE KNEW MUCH ABOUT IT MISTER HOLMES"}

{'prediction': 'FOUR OR FIVE OF THE LATTER ONLY LINGERED ABOUT THE DOOR OF THE PRISON OF UNCAS WARRY BUT CLOSE OBSERVERS OF THE MANNER OF THEIR CAPTIVE', 'label': 'FOUR OR FIVE OF THE LATTER ONLY LINGERED ABOUT THE DOOR OF THE PRISON OF UNCAS WARY BUT CLOSE OBSERVERS OF THE MANNER OF THEIR CAPTIVE'}

{'prediction': 'THEY SAY ILLUMINATION BY CANDALITE IS THE PRETTIEST IN THE WORLD', 'label': 'THEY SAY ILLUMINATION BY CANDLE LIGHT IS THE PRETTIEST IN THE WORLD'}

{'prediction': 'JUST SMELL THE WILD ROSES THEY ARE ALWAYS SO SPICY AFTER A RAIN A', 'label': 'JUST SMELL THE WILD ROSES THEY ARE ALWAYS SO SPICY AFTER A RAIN'}

{'prediction': 'HE WAS SUCH A BIG BOY THAT HE WORE HIGH BOOTS AND CARRIED A JACK KNIFE', 'label': 'HE WAS SUCH A BIG BOY THAT HE WORE HIGH BOOTS AND CARRIED A JACK KNIFE'}

{'prediction': "THROUGHOUT THE ENTIRE EVOLUTION OF CONSPICUOUS EXPENDITURE WHETHER OF GOODS OR OF SERVICES OR HUMAN LIFE RUNS THE OBVIOUS IMPLICATION THAT IN ORDER TO EFFECTUALLY MEND THE CONSUMER'S GOOD FAME IT MUST BE AN EXPENDITURE OF SUPERF", 'label': "THROUGHOUT THE ENTIRE EVOLUTION OF CONSPICUOUS EXPENDITURE WHETHER OF GOODS OR OF SERVICES OR HUMAN LIFE RUNS THE OBVIOUS IMPLICATION THAT IN ORDER TO EFFECTUALLY MEND THE CONSUMER'S GOOD FAME IT MUST BE AN EXPENDITURE OF SUPERFLUITIES"}

{'prediction': 'SOME IMAGES LIKE SOME SENSATIONS FEEL VERY FAMILIAR WHILE OTHERS FEEL STRANGE', 'label': 'SOME IMAGES LIKE SOME SENSATIONS FEEL VERY FAMILIAR WHILE OTHERS FEEL STRANGE'}

{'prediction': 'AND WHAT WAS THE SUBJECT OF THE POEM SAID THE PERSON WHO MADE THE REMARK', 'label': 'AND WHAT WAS THE SUBJECT OF THE POEM SAID THE PERSON WHO MADE THE REMARK'}

{'prediction': 'THEN THEY SPEAD IN GREAT HASTE FOR THE DOOR AND THE GOAT GAVE A FINAL BUTT THAT SENT THE ROW OF ROYAL LADIES ALL DIVING INTO THE CORRIDOR IN ANOTHER TANGLE WHEREUPON THEY SHRIEKED IN A MANNER THAT TERRIFIED EVERYONE WITHIN', 'label': 'THEN THEY SPED IN GREAT HASTE FOR THE DOOR AND THE GOAT GAVE A FINAL BUTT THAT SENT THE ROW OF ROYAL LADIES ALL DIVING INTO THE CORRIDOR IN ANOTHER TANGLE WHEREUPON THEY SHRIEKED IN A MANNER THAT TERRIFIED EVERYONE WITHIN SOUND OF THEIR VOICES'}

Sayak Paul · Answer 2 · Fri Aug 13 2021 10:12:59 GMT+0800 (China Standard Time)

This is great work @vasudevgupta7!

I will try to do some hparams tuning get around 3% WER further.

There might be a number of general reasons for this:

Layer defaults. PyTorch and TensorFlow/Keras layers have different default values.
Are we using the exact same hyperparameter values that have been used in the original wav2vec2 fine-tuning for LiibriSpeech?

However, I'd consider the difference to be a minor one at this stage. Also, this is the 960h variant of the dataset and model right?

Vasudev Gupta · Answer 3 · Fri Aug 13 2021 18:27:01 GMT+0800 (China Standard Time)

Layer defaults. PyTorch and TensorFlow/Keras layers have different default values.

Since model is initialized from pre-trained weights, it won't matter much I think. Only top most dense layer is randomly initialized which has very less parameters compared to remaining model.

Are we using the exact same hyperparameter values that have been used in the original wav2vec2 fine-tuning for LiibriSpeech?

Most of hparams are similar but I didn't follow the exact hparams as few things are different in our model as compared to original model (we don't have layer drop & batch size would be different mostly as they trained on more memory GPUs possibly).

But main reason of not getting good reason could be following l think:

TF model was trained with static padding (as TF graph will work only with constant seqlen) while PyTorch model was possibly trained with dynamic padding. Dynamic padding is very crucial for this model as this model doesn't accept any attention mask and hence attends pad tokens as well. This is possibly leading to little bad performance. This is very similar to what we discussed here: #14 (comment).

However, I'd consider the difference to be a minor one at this stage. Also, this is the 960h variant of the dataset and model right?

Yes this is obtained by training on 960h variant of dataset. @sayakpaul, so I don't need to perform other training experiments right (considering above discussion also)?

Sayak Paul · Answer 4 · Fri Aug 13 2021 18:51:18 GMT+0800 (China Standard Time)

Dynamic padding is very crucial for this model as this model doesn't accept any attention mask and hence attends pad tokens as well.

This is indeed strange. Attention mask should not only just result in timing improvements but also should help the model converge faster. This is something we can investigate more deeply after GSoC and we can always publish v2 of the models since TF Hub allows for model versioning.

Also, since this model (the one you have fine-tuned) is better than the one you created a PR for, I would suggest keeping this model in the PR.

Vasudev Gupta · Answer 5 · Fri Aug 13 2021 19:18:12 GMT+0800 (China Standard Time)

I think there is little confusion between the checkpoints. Below is summary table of the checkpoints:

checkpoint description	WER (with no padding)	WER with constant padding
converted checkpoint (obtained this by running `convert_torch_to_tensorflow.py`)	3 %	6 %
finetuned checkpoint (obtained this by running `main.py`)	5.6 %	6.7 %

So, 1st checkpoint is clearly better than the checkpoint which I finetuned. I need to export 1st right??

Sayak Paul · Answer 6 · Fri Aug 13 2021 19:45:06 GMT+0800 (China Standard Time)

I see. I think this is still confusing as your statement:

TF model was trained with static padding (as TF graph will work only with constant seqlen) while PyTorch model was possibly trained with dynamic padding.

I am under the impression that both the models (with the description = "finetuned checkpoint (obtained this by running main.py)") use static padding. In other words, I am not sure what their differences are.

In any case, could we do batched padding i.e. instead of a prefixed sequence length can we pad a batch with respect to the highest sequence length present in that batch? Will this help?

Vasudev Gupta · Answer 7 · Sat Aug 14 2021 02:23:55 GMT+0800 (China Standard Time)

@sayakpaul sorry for all the confusion again. Here is the detailed explanation of what I wanted to convey you:

Discussion-1

First of all, we have 2 fine-tuned checkpoints:

one is just a TensorFlow equivalent of pytorch checkpoint which is converted using my conversion script. Now this is trained my Facebook originally in PyTorch (possibly with dynamic padding as pytorch supports dynamic padding on GPUs). So model was trained with variables sequence lengths. (calling ckpt-1)
other is fine-tuned by me using main.py. Now since TF model doesn't allow variables length on TPUs, I fixed the sequence length to 246000 and padded/restricted all the sequences to this length. (calling ckpt-2)

In table in above comment, 1st row represents ckpt-1 while 2nd row represents ckpt-2.

Discussion-2

Now that model is trained with as per discussion-1, we need to use this model for evaluation. We have 2 choices here again:

either fix sequence length to some constant (let say 246000) & evaluate. (case-1)
or evaluate the model with variable sequence length (without touching sequences length). (case-2)

We are evaluating TF models (both ckpt-1 & ckpt-2) in table mentioned in above comment.

So when obtaining SavedModel, we will have to fix sequences to some constant value (lets say 246000) to be able to feed them to SavedModel. So we will have to essentially follow case-1. In table in above comment, 3rd column represents case-1.

But case-1 shouldn't be considered to report metrics as with same model with inferencing at batch size of 1 with all variable length (basically in TF eager mode), we can get relatively low WER (as reported in column-2).

Now ckpt-1 (i.e 1st row of table) is giving us lower WER in both the cases-1,2; so I was planning to export this. Do you think it's fine to do that??

Hoping everything is clear now.

Vasudev Gupta · Answer 8 · Sat Aug 14 2021 02:26:03 GMT+0800 (China Standard Time)

In any case, could we do batched padding i.e. instead of a prefixed sequence length can we pad a batch with respect to the highest sequence length present in that batch? Will this help?

Unfortunatly, we will have to specify constant length only when exporting SavedModel (i.e 246000) and evaluate at 246000 seqlen only as some operation in wav2vec2 doesn't allow to make the graph dynamic over sequence length dimension. So this approach won't work for SavedModel.

Sayak Paul · Answer 9 · Sat Aug 14 2021 15:59:54 GMT+0800 (China Standard Time)

Everything is clear now. We need to ensure that we have made extensive notes about these intricacies both inside the repository and the Hub model page so that the back and forth is as minimal as possible. WDYT?

Vasudev Gupta · Answer 10 · Sat Aug 14 2021 18:30:10 GMT+0800 (China Standard Time)

Sure, will do that.

Sayak Paul · Answer 11 · Sat Aug 14 2021 18:51:40 GMT+0800 (China Standard Time)

Closing this issue then.