ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’
jldroid19 opened this issue · comments
🐛 Bug
q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):
File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)
File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))
File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()
File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’
Error
None
Git Version
fatal: not a git repository (or any of the parent directories): .git
To Reproduce
I'm not sure why this is happening. Hard to reproduce
LLM Studio version
v1.4.0-dev
This means you have no GPUs available. Can you run nvidia-smi
to confirm everything is fine?
This seems to be an issue on your environment/system then unfortunately.
@jldroid19 did you figure the issue out?
are you running this in docker?
are you running this in docker?
Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.
I stumbled upon this recently, might be related:
NVIDIA/nvidia-docker#1469
NVIDIA/nvidia-container-toolkit#465 (comment)
There seems to be some issue of gpus being suddenly gone in Docker.