Performance drop after loading a pretrained model.

Question

Performance drop after loading a pretrained model.

maciejkorzepa opened this issue 8 years ago · comments

I am training the model on 1k dataset and decided to split the training into a few so that in each only 10 epochs is processed (I need to share the cluster). By doing so I experienced a drop in WER after the last epoch from one training (WER < 15%) and the first epoch from the next training (WER > 17%) where I load the model from the last epoch of the previous training. In the second training the WER got down to 12.8% but then when third training was started, the WER after first epoch was 14.4%. I tried disabling batch sorting in the first epoch when a pretrained model is loaded, but it didn't help. What can be the cause of this problem?

nn-learner · Answer 1 · Fri Dec 02 2016 14:03:53 GMT+0800 (China Standard Time)

@maciejkorzepa Have you been able to get the WER% down to 15? If so, do you mind sharing your weights? My current desktop has weaker GPUs so I have only been able to go as low as 38% after a long time training.

Maciej Korzepa · Answer 2 · Fri Dec 02 2016 19:10:33 GMT+0800 (China Standard Time)

Sure. https://drive.google.com/drive/folders/0BwyHNzMwVsM-SXdpdlduTlVTT2c?usp=sharing
WER should be around 12%.

nn-learner · Answer 3 · Thu Dec 08 2016 15:16:18 GMT+0800 (China Standard Time)

@maciejkorzepa thank you! Sorry, but do you also mind sharing the network architecture that you used? I assume you used 1280 Hidden nodes, 7 layers, and Traditional RNN Bidirectional?

Maciej Korzepa · Answer 4 · Thu Dec 08 2016 17:31:18 GMT+0800 (China Standard Time)

I didn't make a single change in the architecture, so I am using the default sizes.

Sean Naren · Answer 5 · Thu Dec 08 2016 17:37:20 GMT+0800 (China Standard Time)

Been on a bit of a hiatus but am back now! @maciejkorzepa thanks so much for this, I'm going to download and just check the model. Did you just do the librispeech set up, and then hit th Train.lua without modifying parameters? Did you run into any inf costs? What hardware did you run this on?

Maciej Korzepa · Answer 6 · Thu Dec 08 2016 18:06:19 GMT+0800 (China Standard Time)

Actually, I changed some parameters and code. I used for training everything apart from test-clean and test-other. 12% error rate was achieved on test-clean. I got -inf costs almost in each epoch. But as WER kept going down, I didn't think it was a big problem. As for the parameters, maxNorm was set to 100. I ran on 2xK80 with batch size 64. One epoch took almost 8 hours. The change in the code I made was decreasing the batchsize for a few hundred last utterances, as they are some considerably bigger than the rest. So instead of setting batch size ~30-40 for whole training which would reduce the speed of training, I just changed to smaller batch size (24) in the end of an epoch (this is needed for the first epoch when the data is sorted). In DS2 paper, Baidu mentions how they handle long utterances leading to out of memory errors:

and sometimes very deep networks can exceed the GPU memory capacity when processing long utterances. This can happen unpredictably, especially when the distribution of utterance lengths includes outliers, and it is desirable to avoid a catastrophic failure when this occurs. When a requested memory allocation exceeds available GPU memory, we allocate page-locked GPU-memory-mapped CPU memory using cudaMallocHost instead. This memory can be accessed directly by the GPU by forwarding individual memory transactions over PCIe at reduced bandwidth, and it allows a model to continue to make progress even after encountering an outlier.

This solution allows to use much bigger batch size and thus speed up the training. I am wondering whether implementing this in your model would be feasible.

On the other hand, I found that increasing batch size is not desirable when it comes to WER. The smaller the batch size is, the more noisy the gradients are which results helps in getting out from local minima. I did some tests on LibriSpeech 100h and as far as I remember, for batch size 75, the WER stabilized at 58%, for batch size 40 at 52% and for batch size 12 at 42%. I didn't try decreasing the batch size with 1000h as 8h per epoch was already long enough for me :)

Sean Naren · Answer 7 · Thu Dec 08 2016 18:13:02 GMT+0800 (China Standard Time)

@maciejkorzepa you're awesome thanks so much for this!

Honestly the issue here is Torch isn't as memory efficient (especially the RNNs) as Baidus' internal code. But as you said, it's still trainable just takes forever.

I'll download the model and do some checks/update documentation. Are you fine with me using this as the pre-trained network for Librispeech? I do have an LSTM based network training which is much smaller but its hovering around ~20 WER, a WER of 12 is awesome!

Maciej Korzepa · Answer 8 · Thu Dec 08 2016 18:39:38 GMT+0800 (China Standard Time)

@SeanNaren Sure, go ahead! I think I might be able to use 4xK80 for the training soon, I might then try reducing the batch size and see if WER can get any lower...

nn-learner · Answer 9 · Tue Jan 03 2017 15:45:16 GMT+0800 (China Standard Time)

@maciejkorzepa Sorry, I have a question. I was trying to load your weight but it scored a 99% WER on the test-clean. Are there specific parameters that you used on test.lua?

Maciej Korzepa · Answer 10 · Tue Jan 03 2017 16:56:37 GMT+0800 (China Standard Time)

@nn-learner Maybe your input spectrograms were processed with different parameters? I used:
-windowSize 0.02 -stride 0.01 -sampleRate 16000 -processes 8 -audioExtension flac
I actually didn't manage to run Test.lua due to some error (I don't remember what it was exactly), but my project group tried to run Predict.lua with some samples from test-clean and most of transcriptions were perfect and only a few had some very minor errors (e.g. 'I have' instead of 'I've'), so I assumed that ~12% WER calculated during validation in Train.lua was realistic.

Suhas · Answer 11 · Fri Jan 06 2017 06:01:42 GMT+0800 (China Standard Time)

@maciejkorzepa on what basis did you choose maxNorm as 100?. I went through maxNorm paper (http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf) . They mention to take average norm across many updates and choose half or ten times the value of average norm . So, I just wanted to know did you take average norm for 1 complete epoch or multiple epochs and then went on to take 100 as maxNorm (though multiple epochs does not make sense because the weights would be tuned to the data and the norm would be less) or did you try different values for maxNorm and 100 worked out.

Maciej Korzepa · Answer 12 · Fri Jan 06 2017 06:06:49 GMT+0800 (China Standard Time)

@suhaspillai To be honest, I set it to 100 after reading posts from issue #51 and I haven't tried other values since then.

Suhas · Answer 13 · Sun Jan 08 2017 06:51:38 GMT+0800 (China Standard Time)

OKay.Thanks

Shantanu Dev · Answer 14 · Thu Jan 12 2017 15:10:52 GMT+0800 (China Standard Time)

If I am trying to get this eventually working with live audio recordings, do you think its a good idea to add noise to the data and train it again?

Sean Naren · Answer 15 · Thu Jan 12 2017 18:12:35 GMT+0800 (China Standard Time)

@shantanudev, it is a good idea to insert noise into your training data and train on this, however the repo doesn't currently support this.