h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://h2oai.github.io/h2o-llmstudio/

Home Page:https://gpt-gm.h2o.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

jldroid19 opened this issue · comments

🐛 Bug

q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)

File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))

File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git

To Reproduce

I'm not sure why this is happening. Hard to reproduce

LLM Studio version

v1.4.0-dev

This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?

image

image

What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.

This seems to be an issue on your environment/system then unfortunately.

@jldroid19 did you figure the issue out?

@psinger I have not.

are you running this in docker?

are you running this in docker?

Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.

I stumbled upon this recently, might be related:
NVIDIA/nvidia-docker#1469

NVIDIA/nvidia-container-toolkit#465 (comment)

There seems to be some issue of gpus being suddenly gone in Docker.