Not able to run with GPU, CUDNN_INTERNAL_STATUS_ERROR

Question

Not able to run with GPU, CUDNN_INTERNAL_STATUS_ERROR

1992avinash opened this issue 7 years ago · comments

1992avinash commented 7 years ago

Hi,

I am very impressed by your work and i appreciate it.
Your code run with cpu but I am not able to get it working with gpu

I am using theano as backend as mentioned. I have cuda configured properly as I am able to run other gpu based models of mine.

But when i run your code with GPU i get the following error, can u get me out of this, I am looking forward to train your models on my own dataset.

python main.py --model=sr --mode=fast
Using Theano backend.
Using cuDNN version 5110 on context None
Mapped name None to device cuda: GeForce GT 710 (0000:01:00.0)
Old Size : (1000, 1000, 3)
New Size : (2000, 2000, 3)
Image is reshaped to : (2000, 2000, 3)
th
Model loaded.
Traceback (most recent call last):
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 62, in
model.upscale(path, save_intermediate=save, mode=mode, patch_size=patch_size, suffix=suffix)
File "/home/avinash/ml/Image-Super-Resolution/models.py", line 190, in upscale
result = model.predict(img_conv, batch_size=128, verbose=verbose)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/engine/training.py", line 1591, in predict
batch_size=batch_size, verbose=verbose)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/engine/training.py", line 1218, in _predict_loop
batch_outs = f(ins_batch)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 1196, in call
return self.function(*inputs)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR
Apply node that caused the error: GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), dilation=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
Toposort index: 36
Inputs types: [GpuArrayType(float32, 4D), GpuArrayType(float32, 4D), GpuArrayType(float32, 4D), <theano.gof.type.CDataType object at 0x7f448b27ef60>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1, 3, 2000, 2000), (64, 3, 9, 9), (1, 64, 2000, 2000), 'No shapes', (), ()]
Inputs strides: [(48000000, 16000000, 8000, 4), (972, 324, 36, 4), (1024000000, 16000000, 8000, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7f448793c450>, 1.0, 0.0]
Outputs clients: [[GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](GpuArrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, InplaceGpuDimShuffle{x,0,x,x}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
*** Error in `python': free(): invalid pointer: 0x00007f44b7fca060 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f44e70997e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f44e70a1e0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f44e70a598c]
/usr/local/cuda-8.0/lib64/libcudnn.so.5(+0x3bb291)[0x7f44c6a1c291]
/usr/local/cuda-8.0/lib64/libcudnn.so.5(cudnnDestroy+0x118)[0x7f44c669e4a8]
python[0x49163e]
python[0x587b71]
/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/pygpu-0.6.5-py3.5-linux-x86_64.egg/pygpu/gpuarray.cpython-35m-x86_64-linux-gnu.so(+0xb43e)[0x7f44cd2d243e]
python[0x515e1d]
python(_PyGC_CollectNoFail+0x27)[0x605327]
python(PyImport_Cleanup+0x354)[0x51b0a4]
python(Py_Finalize+0x5e)[0x5fea5e]
python(Py_Main+0x644)[0x63e9c4]
python(main+0xe1)[0x4cfe41]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f44e7042830]
python(_start+0x29)[0x5d5f29]
======= Memory map: ========
00400000-007a8000 r-xp 00000000 08:03 4202316 /home/avinash/.virtualenvs/cv/bin/python3
009a8000-009aa000 r--p 003a8000 08:03 4202316 /home/avinash/.virtualenvs/cv/bin/python3
009aa000-00a41000 rw-p 003aa000 08:03 4202316 /home/avinash/.virtualenvs/cv/bin/python3

Somshubra Majumdar · Answer 1 · Tue Jun 06 2017 01:26:11 GMT+0800 (China Standard Time)

Hmm I haven't seen this error before.

I see that your image is 2000x2000. Perhaps the image is simply too large to fit in GPU ? Have you tried it without --mode=fast ?

1992avinash · Answer 2 · Tue Jun 06 2017 13:25:43 GMT+0800 (China Standard Time)

I also tried with a smaller image 100x100 input image, but got the same error

python main.py bmw100.jpg --mode=fast --model=sr
Using Theano backend.
Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GT 710 (0000:01:00.0)
Old Size : (100, 100, 3)
New Size : (200, 200, 3)
Image is reshaped to : (200, 200, 3)
Model loaded.
Traceback (most recent call last):
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 50, in
model.upscale(path, save_intermediate=save, mode=mode, patch_size=patch_size, suffix=suffix)
File "/home/avinash/ml/tmp/Image-Super-Resolution/models.py", line 186, in upscale
result = model.predict(img_conv, batch_size=128, verbose=verbose)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/engine/training.py", line 1591, in predict
batch_size=batch_size, verbose=verbose)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/engine/training.py", line 1218, in _predict_loop
batch_outs = f(ins_batch)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 1196, in call
return self.function(*inputs)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/avinash/.virtualenvs/cv/local/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in call
self.fn() if output_subset is None else
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR
Apply node that caused the error: GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='half', subsample=(1, 1), dilation=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
Toposort index: 36
Inputs types: [GpuArrayType(float32, 4D), GpuArrayType(float32, 4D), GpuArrayType(float32, 4D), <theano.gof.type.CDataType object at 0x7f20177d86d8>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1, 3, 200, 200), (64, 3, 9, 9), (1, 64, 200, 200), 'No shapes', (), ()]
Inputs strides: [(480000, 160000, 800, 4), (972, 324, 36, 4), (10240000, 160000, 800, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7f2014a70900>, 1.0, 0.0]
Outputs clients: [[GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](GpuArrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, InplaceGpuDimShuffle{x,0,x,x}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

1992avinash · Answer 3 · Tue Jun 06 2017 13:28:57 GMT+0800 (China Standard Time)

which GPU you are training the model on? I have nvidia gtx 710 2gb

Somshubra Majumdar · Answer 4 · Tue Jun 06 2017 14:09:09 GMT+0800 (China Standard Time)

I'm using a GTX 980M. Can you try without --mode=fast? Somshubra Majumdar

…

On Jun 6, 2017 00:28, "1992avinash" ***@***.***> wrote: which GPU you are training the model on? I have nvidia gtx 710 2gb — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC6Emo-eGVOmbaA-YkIfkIj8lgHRoTmlks5sBOOZgaJpZM4Nv-Q2> .

1992avinash · Answer 5 · Tue Jun 06 2017 14:36:03 GMT+0800 (China Standard Time)

Hi I tried without --model=fast.

I am getting a value error

ValueError: images and kernel must have the same stack size

Note : I dont think that image representation in theano is the issue either.
contents of my "keras.json" file are

{
"floatx": "float32",
"image_data_format": "channels_first",
"epsilon": 1e-07,
"backend": "theano"
}

Somshubra Majumdar · Answer 6 · Tue Jun 06 2017 14:51:57 GMT+0800 (China Standard Time)

Hmm I really don't know then.