可以正常训练，但是测试模型出现UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

Question

可以正常训练，但是测试模型出现UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

shakey-cuimiao opened this issue 5 years ago · comments

pciBusID: 0000:83:00.0
totalMemory: 10.76GiB freeMemory: 2.03GiB
2020-04-15 18:38:22.765241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-04-15 18:38:22.766743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 18:38:22.766773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-04-15 18:38:22.766789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-04-15 18:38:22.766950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1776 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5)
Restore from ./east_icdar2015_resnet_v1_50_rbox/model.ckpt-49491
WARNING:tensorflow:From /opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Find 6 images
2020-04-15 18:38:30.125680: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-15 18:38:30.188806: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node resnet_v1_50/conv1/Conv2D}}]]
[[{{node feature_fusion/concat_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "eval.py", line 196, in
tf.app.run()
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "eval.py", line 159, in main
score, geometry = sess.run([f_score, f_geometry], feed_dict={input_images: [im_resized]})
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node resnet_v1_50/conv1/Conv2D (defined at /opt/shakey/deep-learning/EAST/nets/resnet_utils.py:122) ]]
[[node feature_fusion/concat_3 (defined at /opt/shakey/deep-learning/EAST/model.py:80) ]]

Caused by op 'resnet_v1_50/conv1/Conv2D', defined at:
File "eval.py", line 196, in
tf.app.run()
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "eval.py", line 140, in main
f_score, f_geometry = model.model(input_images, is_training=False)
File "/opt/shakey/deep-learning/EAST/model.py", line 40, in model
logits, end_points = resnet_v1.resnet_v1_50(images, is_training=is_training, scope='resnet_v1_50')
File "/opt/shakey/deep-learning/EAST/nets/resnet_v1.py", line 252, in resnet_v1_50
reuse=reuse, scope=scope)
File "/opt/shakey/deep-learning/EAST/nets/resnet_v1.py", line 193, in resnet_v1
net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
File "/opt/shakey/deep-learning/EAST/nets/resnet_utils.py", line 122, in conv2d_same
rate=rate, padding='VALID', scope=scope)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1155, in convolution2d
conv_dims=2)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1058, in convolution
outputs = layer.apply(inputs)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply
return self.call(inputs, *args, **kwargs)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 530, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in call
outputs = self.call(inputs, *args, **kwargs)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 966, in call
return self.conv_op(inp, filter)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 591, in call
return self.call(inp, filter)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 208, in call
name=self.name)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node resnet_v1_50/conv1/Conv2D (defined at /opt/shakey/deep-learning/EAST/nets/resnet_utils.py:122) ]]
[[node feature_fusion/concat_3 (defined at /opt/shakey/deep-learning/EAST/model.py:80) ]]

pankSM · Answer 1 · Wed May 27 2020 17:50:12 GMT+0800 (China Standard Time)

您好,您训练的时候是如何使用gpu的,我按照那个教程来,结果gpu 内存使用才60M,有空的时候烦劳给解答下,谢谢

unyxs281 · Answer 2 · Fri Nov 27 2020 10:33:50 GMT+0800 (China Standard Time)

您好，我也遇到同样的问题，单个GPU可以训练，但是按照教程指定多个GPU就出现同样的错误。烦劳给解答下，谢谢。

unyxs281 · Answer 3 · Mon Nov 30 2020 18:07:10 GMT+0800 (China Standard Time)

这个问题是因为gpu内存不够。

Mohammed Ayub · Answer 4 · Thu Apr 15 2021 15:10:51 GMT+0800 (China Standard Time)

@argman
I get the same error. It started to train fine on CPU but since it was very slow, trying this on one GPU fails with the same stack trace. Is this really because of GPU memory or something else ?
I tried it with --num_readers=1 and also setting --gpu_batch_size=1 running on g4dn (ec2) machine which have 16GB memory.

Any help appreciated !

Mohammed Ayub · Answer 5 · Fri Apr 16 2021 00:33:11 GMT+0800 (China Standard Time)

Looks like this was a CuDNN issue which was popping up in the log

2021-04-15 08:49:03.630044: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.1 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-04-15 08:49:03.632954: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.1 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

As it said I had 7.5.1 and the source was compiled on 7.6.0. After checking my cuda version with nvcc --version I did the conda install as follows which seemed to fix the issue -

conda install https://anaconda.org/anaconda/cudnn/7.6.0/download/linux-64/cudnn-7.6.0-cuda10.0_0.tar.bz2

The recommended way I think is to do the OS level changes from Nvidia, however I did not want to touch OS packages.
After the conda install It picks up the cudnn runtime library first from the environment so it worked.

Himanchal Chandra · Answer 6 · Tue May 11 2021 17:45:43 GMT+0800 (China Standard Time)

You can use two method to avoid this situation:

Allow growth: (more flexible):
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
Allocate fixed memory:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

I hope it helps!