CreateComputeInstance() shows “Not Supported”

Question

CreateComputeInstance() shows “Not Supported”

ytaoeer opened this issue a year ago · comments

go ： 1.18
go-nvml v0.12.0-1
the code is below

gpu instance can be created and using command line, i can create gi and ci.

ytaoer · Answer 1 · Sun Jul 09 2023 09:38:20 GMT+0800 (China Standard Time)

NVIDIA-SMI 535.54.03
Driver Version: 535.54.03
CUDA Version: 12.2
NVIDIA A100-PCIE-40GB

Evan Lezar · Answer 2 · Mon Jul 10 2023 17:33:37 GMT+0800 (China Standard Time)

gpu instance can be created and using command line, i can create gi and ci.

Which commands (I assume nvidia-smi) do you use?

ytaoer · Answer 3 · Mon Jul 10 2023 18:43:32 GMT+0800 (China Standard Time)

yes, "sudo nvidia-smi mig -i 0 -cgi 19 -C",i can create gi and ci；
but by using nvml go lib, i can just create gi, when using CreateComputeInstance() to create ci, it fail

Evan Lezar · Answer 4 · Mon Jul 10 2023 19:57:48 GMT+0800 (China Standard Time)

Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index 1 wherease the nvidia-smi command accesses device 0.

For what it's worth, we use the following flow to create a compute instance:

gi, ret = device.CreateGpuInstance(&giProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating GPU instance: %v", ret)
}

ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error getting Compute instance profile info for: %v", ret)
}

_, ret = gi.CreateComputeInstance(&ciProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating Compute instance: %v", ret)
}

in some of our toolking. Note the call to GetComputeInstanceProfileInfo. I believe you can use 0 for both

ytaoer · Answer 5 · Mon Jul 10 2023 20:27:19 GMT+0800 (China Standard Time)

yes，this node has 4 a100.i will try GetComputeInstanceProfileInfo to create ci profileinfo. thanks

ytaoer · Answer 6 · Mon Jul 10 2023 21:46:58 GMT+0800 (China Standard Time)

Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index 1 wherease the nvidia-smi command accesses device 0.

For what it's worth, we use the following flow to create a compute instance:
gi, ret = device.CreateGpuInstance(&giProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating GPU instance: %v", ret)
}

ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error getting Compute instance profile info for: %v", ret)
}

_, ret = gi.CreateComputeInstance(&ciProfileInfo)
if ret != nvml.SUCCESS {
	return fmt.Errorf("error creating Compute instance: %v", ret)
}
in some of our toolking. Note the call to GetComputeInstanceProfileInfo. I believe you can use 0 for both

it works. thank you very much. so i guess the problem is creating ComputeInstanceProfileInfo by myself. but according to the source code, the go lib only pass ComputeInstanceProfileInfo.Id to the nvml.

Evan Lezar · Answer 7 · Mon Jul 10 2023 22:48:56 GMT+0800 (China Standard Time)

As far as I am aware, there are different versions of the ComputeInstanceProfileInfo struct -- or at least a versioned struct was introduced recently. It may be that constructing it manually created the wrong version and this is what caused it error.

Actually, it may also be that the ProfileID and the info ID are not the same. Could you output the returned ciProfileInfo struct and confirm its .ID value?

ytaoer · Answer 8 · Tue Jul 11 2023 20:23:11 GMT+0800 (China Standard Time)

i use a release version go-nvml v0.12.0-1. the output is below