Running OOM in cell 30 when using GPU

Question

Running OOM in cell 30 when using GPU

xtr33me opened this issue 8 years ago · comments

First off, thanks for this share. This NLP stuff is really cool, but can be overwhelming. This project has helped immensely in helping me to better understand. Unfortunately, it seems troubleshooting these models and implementations is almost an art in itself.

So I am currently using Tensorflow and I wrote a scraper to pull data from Buzzfeed to act as my training set since the reuters data didn't seem to be enough due to glove containing 40k vocab. When I was running against the CPU, all worked fine but in a period of about 7 hours, I only made it through about 4 iterations out of 500. So I have attempted to jump over to using the GPU. I am running on a Mac with an Nvidia 650M and 1 GB of VRAM. So I am aware that this isn't the best of hardware, but I believe it still should be doable right?

So when I run the train.py file (converted from ipynb), I am getting the below OOM error. I know you have been using Theano, so if you aren't sure then just disregard. However if you know how I may be able to overcome this issue I'd love to hear it. It seems to be erroring out around cell 30. What is baffling to me is that it says total memory is 1023.MiB but free is 49.Mib. As I run each time it reduces. Now I saw in tF forums that Tensorflow allocates all GPU and that you really can't tell how much is free because it is managed internally....so maybe this is nothing. I was trying to figure out how to flush the GPU mem but I haven't had any luck with that either. Even after restarting the mac, I still see about the same thing.

I have tried adjusting the training sample size and some of the other variables to see if I could get it to just run through even once but it still ends up crashing with an OOM error. My next thing I am going to try is getting the TF summarizer working so perhaps I might be able to get a bit more insight via tensorboard. I do have a much better GPU on my PC, but I will need to set this all up in a docker vm or something if I take that approach. However if it works, I will be happy.

If you have any insight, please send it my way. Thanks again!!

`Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.dylib locally
1.0.7
number of examples 49372 49372
dimension of embedding space for words 100
vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words
total number of different words 74477 74477
number of words outside vocabulary which we can substitue using glove similarity 12523
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 21954
46372 46372 3000 3000
H: oops building
D: there’s something different about this building can you guess what
H: mathematical formula for beer goggles
D: british scientists discover the exact equation so-called beer goggles
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:883] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GT 650M
major: 3 minor: 0 memoryClockRate (GHz) 0.9
pciBusID 0000:01:00.0
Total memory: 1023.69MiB
Free memory: 49.59MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

.....

I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703a04a00 of size 1048576
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703b04a00 of size 1048576
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x700b8dc00 of size 149504
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x700de4400 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x703c04a00 of size 1123840
I tensorflow/core/common_runtime/bfc_allocator.cc:689] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 28 Chunks of size 256 totalling 7.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 24 Chunks of size 2048 totalling 48.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 204800 totalling 800.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 31 Chunks of size 1048576 totalling 31.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1139200 totalling 1.09MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 16000000 totalling 15.26MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 48.18MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 51998720
InUse: 50520576
MaxInUse: 50520576
NumAllocs: 129
MaxAllocSize: 16000000

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **************************************************************************************************__
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 144.04MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:907] Resource exhausted: OOM when allocating tensor with shape[944,40000]
Traceback (most recent call last):
File "train.py", line 275, in
name = 'timedistributed_1')))
File "/usr/local/lib/python2.7/site-packages/keras/models.py", line 307, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 484, in call
self.build(input_shapes[0])
File "/usr/local/lib/python2.7/site-packages/keras/layers/wrappers.py", line 102, in build
self.layer.build(child_input_shape)
File "/usr/local/lib/python2.7/site-packages/keras/layers/core.py", line 604, in build
name='{}_W'.format(self.name))
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 59, in glorot_uniform
return uniform(shape, s, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 32, in uniform
return K.random_uniform_variable(shape, -scale, scale, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 248, in random_uniform_variable
return variable(value, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 132, in variable
get_session().run(v.initializer)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 343, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 567, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 640, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 662, in _do_call
e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[944,40000]
[[Node: random_uniform_13/RandomUniform = RandomUniformT=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/gpu:0"]]
Caused by op u'random_uniform_13/RandomUniform', defined at:
File "train.py", line 275, in
name = 'timedistributed_1')))
File "/usr/local/lib/python2.7/site-packages/keras/models.py", line 307, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 484, in call
self.build(input_shapes[0])
File "/usr/local/lib/python2.7/site-packages/keras/layers/wrappers.py", line 102, in build
self.layer.build(child_input_shape)
File "/usr/local/lib/python2.7/site-packages/keras/layers/core.py", line 604, in build
name='{}_W'.format(self.name))
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 59, in glorot_uniform
return uniform(shape, s, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 32, in uniform
return K.random_uniform_variable(shape, -scale, scale, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 247, in random_uniform_variable
value = tf.random_uniform_initializer(low, high, dtype=tf_dtype)(shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py", line 98, in _initializer
return random_ops.random_uniform(shape, minval, maxval, dtype, seed=seed)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/random_ops.py", line 182, in random_uniform
seed2=seed2)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 96, in _random_uniform
seed=seed, seed2=seed2, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 694, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1154, in init
self._traceback = _extract_stack()
`

Ehud Ben-Reuven · Answer 1 · Thu Aug 18 2016 21:41:19 GMT+0800 (China Standard Time)

500 epochs is just an arbitrary number I kill the process much earlier than that.

Running against GPU memory problems is very common.
The main differences between GPU cards is their memory size which is sometimes more important
than speed.
The Mac GPU is too small and some of its memory is used by OS X to generate your screen.
Try running on AWS with g2.x2large machine.

You can reduce batch_size until it fits GPU memory.
You can also try smaller model size (less rnn layers and nodes) and see what happens.
The model parameters I used are also somewhat arbitrary.

Daniel Krotov · Answer 2 · Thu Aug 18 2016 22:59:31 GMT+0800 (China Standard Time)

Thanks for the help....yet again! Do you have a paypal account? If so, let me know the email...I'd like to buy a few drinks for ya to say thanks :)

Ehud Ben-Reuven · Answer 3 · Fri Aug 19 2016 20:38:49 GMT+0800 (China Standard Time)

🍺🍺 thanks