NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CreateComputeInstance() shows “Not Supported”

ytaoeer opened this issue · comments

commented

go : 1.18
go-nvml v0.12.0-1
the code is below
image
image
gpu instance can be created and using command line, i can create gi and ci.

commented

NVIDIA-SMI 535.54.03
Driver Version: 535.54.03
CUDA Version: 12.2
NVIDIA A100-PCIE-40GB

gpu instance can be created and using command line, i can create gi and ci.

Which commands (I assume nvidia-smi) do you use?

commented

yes, "sudo nvidia-smi mig -i 0 -cgi 19 -C",i can create gi and ci;
but by using nvml go lib, i can just create gi, when using CreateComputeInstance() to create ci, it fail

Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index 1 wherease the nvidia-smi command accesses device 0.

For what it's worth, we use the following flow to create a compute instance:

gi, ret = device.CreateGpuInstance(&giProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating GPU instance: %v", ret)
}

ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error getting Compute instance profile info for: %v", ret)
}

_, ret = gi.CreateComputeInstance(&ciProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating Compute instance: %v", ret)
}

in some of our toolking. Note the call to GetComputeInstanceProfileInfo. I believe you can use 0 for both

commented

yes,this node has 4 a100.i will try GetComputeInstanceProfileInfo to create ci profileinfo. thanks

commented

Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index 1 wherease the nvidia-smi command accesses device 0.

For what it's worth, we use the following flow to create a compute instance:

gi, ret = device.CreateGpuInstance(&giProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating GPU instance: %v", ret)
}

ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error getting Compute instance profile info for: %v", ret)
}

_, ret = gi.CreateComputeInstance(&ciProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating Compute instance: %v", ret)
}

in some of our toolking. Note the call to GetComputeInstanceProfileInfo. I believe you can use 0 for both

it works. thank you very much. so i guess the problem is creating ComputeInstanceProfileInfo by myself. but according to the source code, the go lib only pass ComputeInstanceProfileInfo.Id to the nvml.
image

As far as I am aware, there are different versions of the ComputeInstanceProfileInfo struct -- or at least a versioned struct was introduced recently. It may be that constructing it manually created the wrong version and this is what caused it error.

Actually, it may also be that the ProfileID and the info ID are not the same. Could you output the returned ciProfileInfo struct and confirm its .ID value?

commented

i use a release version go-nvml v0.12.0-1. the output is below
image
image