SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Out of memory issue when Train.lua

byuns9334 opened this issue · comments

(I already read this issue: #60)
I am trying to train on Librispeech dataset as well, but when I execute the command line 'th Train.lua -batchSize 7 -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/' ,
I get this error:

th Train.lua -batchSize 7 -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/
Number of parameters: 108028317
[======================================== 387/387 ====================================>] Tot: 3m7s | Step: 847ms
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2489/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/byuns9334/torch/install/bin/luajit: ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:67:
In 2 module of nn.Sequential:
In 1 module of nn.Sequential:
In 5 module of cudnn.BatchBRNNReLU:
/home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:265: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-2489/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resize'
/home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:265: in function 'allocateStorage'
/home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:324: in function 'setSharedWorkspaceSize'
/home/byuns9334/torch/install/share/lua/5.1/cudnn/RNN.lua:537: in function </home/byuns9334/torch/install/share/lua/5.1/cudnn/RNN.lua:404>
[C]: in function 'xpcall'
...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./ModelEvaluator.lua:70: in function 'runEvaluation'
./Network.lua:75: in function 'testNetwork'
./Network.lua:168: in function 'trainNetwork'
Train.lua:43: in main chunk
[C]: in function 'dofile'
...9334/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./ModelEvaluator.lua:70: in function 'runEvaluation'
./Network.lua:75: in function 'testNetwork'
./Network.lua:168: in function 'trainNetwork'
Train.lua:43: in main chunk
[C]: in function 'dofile'
...9334/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I think 7 is already small enough for mini-batch size, so why do I get this OOM error still? Any ideas how to fix this?

I still get the error with mini batch size 1.