ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Question

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

jldroid19 opened this issue 2 months ago · comments

🐛 Bug

q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)

File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))

File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git

To Reproduce

I'm not sure why this is happening. Hard to reproduce

LLM Studio version

v1.4.0-dev

Philipp Singer · Answer 1 · Thu Apr 11 2024 21:24:42 GMT+0800 (China Standard Time)

This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?

jlafrance · Answer 2 · Thu Apr 11 2024 21:35:05 GMT+0800 (China Standard Time)

What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.

Philipp Singer · Answer 3 · Mon Apr 15 2024 14:17:58 GMT+0800 (China Standard Time)

This seems to be an issue on your environment/system then unfortunately.

Philipp Singer · Answer 4 · Mon Apr 22 2024 15:51:45 GMT+0800 (China Standard Time)

@jldroid19 did you figure the issue out?

jlafrance · Answer 5 · Mon Apr 22 2024 19:11:28 GMT+0800 (China Standard Time)

@psinger I have not.

Philipp Singer · Answer 6 · Wed Apr 24 2024 20:42:29 GMT+0800 (China Standard Time)

are you running this in docker?

jlafrance · Answer 7 · Wed Apr 24 2024 22:32:36 GMT+0800 (China Standard Time)

are you running this in docker?

Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.

Philipp Singer · Answer 8 · Fri May 03 2024 19:55:48 GMT+0800 (China Standard Time)

I stumbled upon this recently, might be related:
NVIDIA/nvidia-docker#1469

NVIDIA/nvidia-container-toolkit#465 (comment)

There seems to be some issue of gpus being suddenly gone in Docker.