baidu-research / ba-dls-deepspeech

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MemoryError: Error allocating bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).

Feynman27 opened this issue · comments

Hi,

I'm running this model using a Docker image on an Amazon EC2 instance with a GRID K520 gpu. The data is the "clean" data set from http://www.openslr.org/12/. Everything seems to work fine until iteration 60, after which I hit a MemoryError that I don't quite understand. The trace is provided below:

root:~/ba-dls-deepspeech# python train.py train_corpus.json validation_corpus.json ./models
Using Theano backend.
Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 10.0% of memory, cuDNN 4008)
/root/.local/lib/python2.7/site-packages/Theano-0.8.2-py2.7.egg/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
2016-10-17 19:37:21,963 INFO    (data_generator) Reading description file: train_corpus.json for partition: train
2016-10-17 19:37:22,218 INFO    (data_generator) Reading description file: validation_corpus.json for partition: validation
2016-10-17 19:37:23,500 INFO    (model) Building gru model
2016-10-17 19:37:29,431 INFO    (model) Building train_fn
2016-10-17 19:38:10,702 INFO    (model) Building val_fn
2016-10-17 19:38:16,784 INFO    (data_generator) Iters: 326
2016-10-17 19:38:17,977 INFO    (__main__) Epoch: 0, Iteration: 0, Loss: 276.16986084
2016-10-17 19:38:30,168 INFO    (__main__) Epoch: 0, Iteration: 10, Loss: 164.771255493
2016-10-17 19:38:43,822 INFO    (__main__) Epoch: 0, Iteration: 20, Loss: 157.695297241
2016-10-17 19:38:58,698 INFO    (__main__) Epoch: 0, Iteration: 30, Loss: 155.174377441
2016-10-17 19:39:14,786 INFO    (__main__) Epoch: 0, Iteration: 40, Loss: 156.074935913
2016-10-17 19:39:32,075 INFO    (__main__) Epoch: 0, Iteration: 50, Loss: 164.412872314
2016-10-17 19:39:50,669 INFO    (__main__) Epoch: 0, Iteration: 60, Loss: 168.667358398
Traceback (most recent call last):
  File "train.py", line 155, in <module>
    args.sortagrad)
  File "train.py", line 133, in main
    do_sortagrad=sortagrad)
  File "train.py", line 89, in train
    label_lengths, True])
  File "/root/.local/lib/python2.7/site-packages/Keras-1.1.0-py2.7.egg/keras/backend/theano_backend.py", line 717, in __call__
    return self.function(*inputs)
  File "/root/.local/lib/python2.7/site-packages/Theano-0.8.2-py2.7.egg/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/root/.local/lib/python2.7/site-packages/Theano-0.8.2-py2.7.egg/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/root/.local/lib/python2.7/site-packages/Theano-0.8.2-py2.7.egg/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: Error allocating 37632000 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).
Apply node that caused the error: GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[[ 0.]]]}, Elemwise{switch,no_inplace}.0, Elemwise{Composite{(i0 // (i1 * i2))}}.0, TensorConstant{3000})
Toposort index: 325
Inputs types: [CudaNdarrayType(float32, (True, True, True)), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(1, 1, 1), (), (), ()]
Inputs strides: [(0, 0, 0), (), (), ()]
Inputs values: [CudaNdarray([[[ 0.]]]), array(196), array(16), array(3000)]
Outputs clients: [[GpuIncSubtensor{Inc;:int64:}(GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, ScalarFromTensor.0), GpuIncSubtensor{InplaceInc;int64::}(GpuAlloc{memset_0=True}.0, GpuIncSubtensor{Inc;:int64:}.0, Constant{0})]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

This usually means that the GPU wasn't able to allocate enough memory to fit the mini-batch.
It was fine for the first few utterances, because we sort the utterances by their duration in the first epoch (so the mini-batches were in general smaller in size). You can either try decreasing the size of the mini-batch or use a different GPU that has more memory (I'm not sure how the K520 manages its memory - it has two 4GB memory areas).
If this doesn't help, also try reducing the number of layers or number of neurons in each layer. Most of the memory requirement is because we try to store activations through time.

Thanks, this makes sense. Reducing the batch size seems to be working.

commented

This happened to me on 4GB and 8GB GPUs. Disabling sorting by duration solved the issue, otherwise memory will be exhausted in first epoch. I'm curious to know the reason.
I tried on sorted dataset and when I disabled minibatch shuffling, It saturated memory again.