NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvmlErrorString error

BrushXiaoMinGuo opened this issue · comments

i import nvml in my code,but when run code,have this error
image

@klueska

Which version of go-nvml are you building against and how do you build your application?

go-nvml is v0.12.0-1
image

build command is : go build main.go

i use nvidia-mig-parted also have this problem

image

@elezar @klueska

Note that when building applications that use the bindings, one has to add the following go build flags:

go build -ldflags="-extldflags=-Wl,-z,lazy" <files>.go

as per #36 (comment) for example.

Since you mentioned mig-parted, I will confirm that we're applying the correct flags there too.

i tested this command, but i get the same error
image
image

@elezar

my go version is
image

@BrushXiaoMinGuo would you be able to test whether the behaviour persists with a newer golang version. We typically us at least 1.18 internally and it may be related to the cgo implementation for older golang versions.

@elezar i update go version to 1.20.8, and try again. but i get the same error
image

Just as a sanity check -- are you able to run nvidia-smi from whatever environment you are in here? Is libnvidia-ml.so.1 in your library path?

Ignore my compiled program for now, I'll test nvidia-mig-manager first,i have the same error
image

there is something about my environment.

  1. this is a k8s cluster, one node is gpu node.
  2. nvidia-smi can run on this gpu node
image
  1. libnvidia-ml.so.1 in my library path
image
  1. mig-manager pod is running
image
  1. i use this yaml and image https://github.com/NVIDIA/mig-parted/blob/main/deployments/gpu-operator/nvidia-mig-manager-example.yaml
  2. but when i exec mig-manager pod, i can not run nvidia-smi
image
  1. i think may be volume mount caught this ,but i don't know why
image

What else do I need to check? Thank you. @klueska @elezar

We currently don't support running the mig-manger outside of the GPU Operator. Meaning these examples are likely out of date and probably need some tweaking to get them to work (though it's not recommended).

That said, I'm guessing the reason you are having issues is that the mig-manager you are starting doesn't have GPU support injected into it.

This is normally done either through a runtime class called nvidia or through making the nvidia runtime the default runtime in containerd. Which method are you using?

I think i use the second one.

  1. i install nvidia-container-runtime on my gpu node
image
  1. modify /etc/docker/daemon.json and set nvidia-container-runtime as the default runtime
image

That looks like a docker demon.json, not a containerd config

yes,it‘s docker demon.json. docker will call containerd.
how can mig-manager have GPU support injected into it, Could you give me some advice?

Unless you are using a very old version of kubernetes or have explicitly selected docker to be your shim layer in kubernetes, I would imagine that docker is not the container runtime ou are using to launch containers in k8s (containerd is the default).

My question before still stands:

This is normally done either through a runtime class called nvidia or through making the nvidia runtime the default runtime in containerd. Which method are you using?

Only once I know the answer to this can I help you further.

I faced the same problem: ./app: symbol lookup error: ./app: undefined symbol: nvmlErrorString
Error exists in code example:

ret := nvml.Init()
if ret != nvml.SUCCESS {
    log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}

If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)

@jgivc how are you building the application?

With regards to:

If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)

This is not entirely true. At present, nvml.Init() also loads the dynamic library and if this fails we the symbol is invalid. If the underlying call to nvmlInit reutrns an error, calling nvml.ErrorString is valid.

We have a workaround for this in another set of libraries we use: https://github.com/NVIDIA/go-nvlib/blob/486ed3f0c8139174a97565985fd48664b3048ad6/pkg/nvml/nvml.go#L38-L69 and we may consider doing something similar here.

how are you building the application?

Just go build. And it work fine on my server. But on my computer that library does not exists and the example code gave me an error.

Not quite on topic, but also a problem with 'undefined symbol'. My application was working fine for several days and suddenly crashed with the error "undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3", although before that there were several successful calls to nvml library methods. Is it possible to somehow intercept such errors so that the application does not crash?