baidu-research / ba-dls-deepspeech

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

loss value becomes NaN after 7 epoch in training phase

faruk-ahmad opened this issue · comments

we are using the implementation for training our own model. We have preprocessed the dataset with the given scripts. But after 7 epoch of training , the loss becomes 'nan'. What can be the possible cause?
Here is the last part of training-log file:
2017-08-14 17:43:18,105 INFO (main) Epoch: 6, Iteration: 50, Loss: 79.07887268066406
2017-08-14 17:44:07,976 INFO (data_generator) Iters: 6
2017-08-14 17:44:25,415 INFO (utils) Checkpointing model to: ./model/
2017-08-14 17:44:25,805 INFO (data_generator) Iters: 54
2017-08-14 17:44:37,734 INFO (main) Epoch: 7, Iteration: 0, Loss: 86.4606704711914
2017-08-14 17:47:57,033 INFO (main) Epoch: 7, Iteration: 10, Loss: 79.75791931152344
2017-08-14 17:51:46,978 INFO (main) Epoch: 7, Iteration: 20, Loss: 81.86383819580078
2017-08-14 17:56:00,494 INFO (main) Epoch: 7, Iteration: 30, Loss: 83.92363739013672
2017-08-14 18:00:55,395 INFO (main) Epoch: 7, Iteration: 40, Loss: 71.31178283691406
2017-08-14 18:06:30,210 INFO (main) Epoch: 7, Iteration: 50, Loss: 85.3790054321289
2017-08-14 18:08:03,423 INFO (data_generator) Iters: 6
2017-08-14 18:08:27,113 INFO (utils) Checkpointing model to: ./model/
2017-08-14 18:08:27,578 INFO (data_generator) Iters: 54
2017-08-14 18:08:57,878 INFO (main) Epoch: 8, Iteration: 0, Loss: 61.189476013183594
2017-08-14 18:14:00,523 INFO (main) Epoch: 8, Iteration: 10, Loss: 98.21914672851562
2017-08-14 18:18:31,384 INFO (main) Epoch: 8, Iteration: 20, Loss: 84.95768819580078
2017-08-14 18:23:51,395 INFO (main) Epoch: 8, Iteration: 30, Loss: nan

N.B. We are using CPU machine, core-i5 with 32GB memory.

Any help would be appreciated.
Thanks in advance.

commented

I have the same problem too. Do you know how to solve this?
Thanks!