Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)

Question

Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)

dahlem opened this issue 7 years ago · comments

I'd like to run deepspeech.torch on a training dataset of size 1M wav files using an AWS's p2.8xlarge instance and I'm running into the stack trace below. I installed torch and deepspeech.torch according to the installation instructions.

I run the training as follows:
th Train.lua -epochSave
-learningRateAnnealing 1.1
-trainingSetLMDBPath data_lmdb/train/
-validationSetLMDBPath data_lmdb/test/
-nGPU 8
-logsTrainPath logs/deepspeech-big/TrainingLoss/
-logsValidationPath logs/deepspeech-big/ValidationScores/
-modelTrainingPath models/deepspeech-big/
-epochs 500
-learningRate 0.01
-maxNorm 20
-momentum 0.9
-batchSize 32
-validationBatchSize 32
-permuteBatch

I have no problem with the 1000 hours of LibriSpeech data.

Any help is greatly appreciated.
Dominik

luajit: ...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 4 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:162: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)
stack traceback:
[C]: in function 'error'
/home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:162: in function 'errcheck'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:140: in function 'createIODescriptors'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:188: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
[C]: in function 'xpcall'
/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function</home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
/home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
/home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:13: in main chunk

Sean Naren · Answer 1 · Thu Apr 20 2017 01:50:03 GMT+0800 (China Standard Time)

How large are the wav files (in seconds?). I wonder if it's running out of memory. Could you monitor the cuda memory usage whilst training?

Suhas · Answer 2 · Thu Apr 20 2017 05:32:12 GMT+0800 (China Standard Time)

I think it is because some of your wav files are short (in secs), check the number of time steps after 1st convolution operation, if it is less than the filter width of 2nd convolution filter, then it will throw an error. Atleast for me this was the case.

Dominik Dahlem · Answer 3 · Fri Apr 21 2017 04:06:26 GMT+0800 (China Standard Time)

@SeanNaren, the wav files are between 0.12 and 40 seconds. Memory did not seem to be the issue. I looked into @suhaspillai suggestion and cut out wav files that were too short and that is working now.

Thank you.

Sean Naren · Answer 4 · Fri Apr 21 2017 04:38:53 GMT+0800 (China Standard Time)

Just to add to this, I think due to the convolutions, the minimum length that a clip can be is 0.5 seconds. I'd highly suggest cutting out anything that isn't 1 second or longer though.