Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error

Question

Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error

mroelandts opened this issue 7 years ago · comments

Hello GustavZ,
I ran into some problems running your code on the Jetson TX2.
At first no problems at all but after a few days I keep receiving this error.
full terminal log:

Model found. Proceed.
Loading frozen model into memory
2018-02-14 12:34:03.208044: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-14 12:34:03.208210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.66GiB freeMemory: 4.71GiB
2018-02-14 12:34:03.208272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-14 12:34:04.632828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
Loading label map
Starting detection
2018-02-14 12:34:28.283994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-14 12:34:28.284142: E tensorflow/core/common_runtime/direct_session.cc:168] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error
Traceback (most recent call last):
  File "object_detection.py", line 249, in <module>
    main()
  File "object_detection.py", line 245, in main
    detection(graph, category, score, expand)
  File "object_detection.py", line 170, in detection
    with tf.Session(graph=detection_graph,config=config) as sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1509, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 628, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

If I first run the program without splitting the model and after wards again with the split turned on, it all works fine! But after I reboot, the problem arises again...
Do you have any idea?

Alexandre Gariépy · Answer 1 · Thu Feb 22 2018 02:44:51 GMT+0800 (China Standard Time)

I had a similar issue. I was using tensorflow 1.5. I downgraded to 1.4.1 and now it works.

Gustav von Zitzewitz · Answer 2 · Sun Mar 04 2018 05:30:28 GMT+0800 (China Standard Time)

yeah thats tensorflow version bound.
The reason is the version the model is exported in.
There are incompabilities between 1.5 and 1.4

Gustav von Zitzewitz · Answer 3 · Fri Mar 09 2018 17:44:54 GMT+0800 (China Standard Time)

@MatthiasRoelandts does the error still appear?

zhang · Answer 4 · Thu Mar 29 2018 10:17:32 GMT+0800 (China Standard Time)

@gustavz now i install tensorflow1.7 on jetson tx2 but face same issue:

Loading label map
Building Graph
2018-03-29 01:16:27.067350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-29 01:16:27.067445: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error
Traceback (most recent call last):
File "object_detection.py", line 301, in
main()
File "object_detection.py", line 297, in main
detection(graph, category, score, expand)
File "object_detection.py", line 180, in detection
with tf.Session(graph=detection_graph,config=config) as sess:
File "/home/nvidia/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1509, in init
super(Session, self).init(target, graph, config=config)
File "/home/nvidia/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 638, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/nvidia/.local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

can you suggest how to do?

Gustav von Zitzewitz · Answer 5 · Wed Apr 25 2018 14:26:07 GMT+0800 (China Standard Time)

@imxboards i also sometimes face this error when i switch tensorflow versions.
It is definetly caused by tensorflows internal changes and in-compabilities.
Try using TF-1.4.

I for my self am also not able to run it with TF-1.7

Isaiah Becker-Mayer · Answer 6 · Sat May 19 2018 03:06:22 GMT+0800 (China Standard Time)

I'm experiencing this exact same issue. I had my code up and running on the Jetson TX2 with tensorflow 1.7, but now after powering down and traveling for a few days, it's giving me this error. I can run it without gpu using export CUDA_VISIBLE_DEVICES='' however this defeats the purpose. Anybody have any solutions? I could try going to an older version of TF, however all the Jetson builds/wheel files I can find are for Python 2.7 whereas my entire project is written in Python 3.5. Any help would be greatly appreciated!

Mike Wise · Answer 7 · Thu May 24 2018 06:22:17 GMT+0800 (China Standard Time)

@ibeckermayer - it is often surprisingly easy to get a Python program running under both Python 3.5 and 2.7. Only took me 30 minutes to convert my 3.5 program - I only had issues where I was using datetime routines to generate utc time.

naisy · Answer 8 · Tue Jun 12 2018 16:07:05 GMT+0800 (China Standard Time)

Hi @gustavz,

I searched about multiple session problem. here:
https://devtalk.nvidia.com/default/topic/1035884/jetson-tx2/cuda-error-creating-more-than-one-session-using-tensorflow/post/5265161/#5265161

We need to add gpu_options in the tf.Session() that is called at the first.
v1.0: object_detection.py
v2.0: rod/model.py

def load_frozenmodel():
...
        input_graph = tf.Graph()
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = allow_memory_growth
        with tf.Session(graph=input_graph, config=config):

Gustav von Zitzewitz · Answer 9 · Wed Jun 13 2018 01:16:12 GMT+0800 (China Standard Time)

@naisy which problem should this solve?

naisy · Answer 10 · Wed Jun 13 2018 09:33:22 GMT+0800 (China Standard Time)

This solves the problem in environments where 2nd session can not be created.
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Sharad Jain · Answer 11 · Fri Jan 17 2020 17:01:20 GMT+0800 (China Standard Time)

I am still getting this issue on version 2.0.0

Melvin Cabatuan · Answer 12 · Wed Apr 22 2020 04:19:09 GMT+0800 (China Standard Time)

Limiting the GPU memory solves this issue for me:

import tensorflow as tf

MEMORY_LIMIT = 1024
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=MEMORY_LIMIT)])
    except RuntimeError as e:
        print(e)

Harendra Singh · Answer 13 · Tue Jul 07 2020 17:24:23 GMT+0800 (China Standard Time)

I am using tensotflow 2.0.0 in c++ code and am facing the same issue. Any solution guys?

ctsams9 · Answer 14 · Fri Jul 31 2020 08:12:36 GMT+0800 (China Standard Time)

I'm using the latest version of tensorflow (2.3.0) with python 3.6.10 and cuda 10.1 and facing the same issue as well in Ubuntu 18.04. export CUDA_VISIBLE_DEVICES=0 or export CUDA_VISIBLE_DEVICES='' are helping to run the code but not consistent enough. I'm new and don't actually know what these exports are doing exactly. This is the error I get:
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Pouyan · Answer 15 · Sun Aug 02 2020 21:07:34 GMT+0800 (China Standard Time)

@ctsams9 I am having a same situation as yours. Did you find any solution so far ?

ctsams9 · Answer 16 · Sun Aug 02 2020 22:46:24 GMT+0800 (China Standard Time)

@ctsams9 I am having a same situation as yours. Did you find any solution so far ?

@horcrux1 No luck so far. I'm planning to post some of the results I get after running some tests here and on stackoverflow soon.

Marcos Reinan de Assis Conceição · Answer 17 · Thu Aug 13 2020 23:26:58 GMT+0800 (China Standard Time)

I'm using the latest version of tensorflow (2.3.0) with python 3.6.10 and cuda 10.1 and facing the same issue as well in Ubuntu 18.04. export CUDA_VISIBLE_DEVICES=0 or export CUDA_VISIBLE_DEVICES='' are helping to run the code but not consistent enough. I'm new and don't actually know what these exports are doing exactly. This is the error I get:
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Just clarifying @ctsams9 exports, both of them set the variable CUDA_VISIBLE_DEVICES, which is useful if you only want cuda/tensorflow to see and work with specific GPU (0 equals the first GPU). If this variable is set to "", then tensorflow will only use CPU for calculations.

In my case, it runs fine without letting my only GPU (0) visible, but I need to fix whatever "device kernel image is invalid" error means to proceed with my projects.

By the way, I am getting the same error on Arch after upgrading tensorflow from pip.

Marcos Reinan de Assis Conceição · Answer 18 · Thu Aug 13 2020 23:29:54 GMT+0800 (China Standard Time)

Limiting the GPU memory solves this issue for me:

import tensorflow as tf

MEMORY_LIMIT = 1024
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=MEMORY_LIMIT)])
    except RuntimeError as e:
        print(e)

I tried to limit TF GPU memory earlier but had no success, sadly :(

Marcos Reinan de Assis Conceição · Answer 19 · Fri Aug 14 2020 00:24:09 GMT+0800 (China Standard Time)

Okey, downgrading tensorflow to version 2.2 did it.

pip install --force-reinstall tensorflow-gpu==2.2

Also, if you have ever used pip install with --ignore-installed to install tensorflow versions or dependencies, consider removing them first.

What's strange for me is that it takes some minutes to initialize tensorflow with GPU. I don't think it's normal :S