I get the same training error by every epoch

Question

I get the same training error by every epoch

byuns9334 opened this issue 7 years ago · comments

I'm running " th Train.lua -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/ -LSTM -hiddenSize 500 -permuteBatch " on librispeech dataset, but I still get the same training error on every epoch, while the loss continuously gets decreased.

Here's what I get:

Number of parameters: 31576697
[==================== 136/136 ================>] Tot: 1m13s | Step: 646ms
Training Epoch: 1 Average Loss: nan Average Validation WER: 100.09 Average Validation CER: 62.14
Saving model..
[==================== 136/136 ================>] Tot: 1m18s | Step: 566ms
Training Epoch: 2 Average Loss: 7047724391721807312664730917666816.000000 Average Validation WER: 100.05 Average Validation CER: 61.98
Saving model..
[==================== 136/136 ================>] Tot: 1m18s | Step: 588ms
Training Epoch: 3 Average Loss: 3568794773768703940829837988462592.000000 Average Validation WER: 100.05 Average Validation CER: 62.00
Saving model..
[==================== 136/136 ================>] Tot: 1m19s | Step: 555ms
Training Epoch: 4 Average Loss: nan Average Validation WER: 100.05 Average Validation CER: 62.03
Saving model..

How should I resolve this?

Sean Naren · Answer 1 · Mon Oct 30 2017 00:44:04 GMT+0800 (China Standard Time)

Something definitely looks wrong with the loss... could you run the tests in the warp-ctc repo for torch and make sure the values are not 0s or infs?

byuns9334 · Answer 2 · Mon Oct 30 2017 15:44:12 GMT+0800 (China Standard Time)

@SeanNaren Which command should I write to run the tests in warp-ctc repository?