CreateComputeInstance() shows “Not Supported”
ytaoeer opened this issue · comments
NVIDIA-SMI 535.54.03
Driver Version: 535.54.03
CUDA Version: 12.2
NVIDIA A100-PCIE-40GB
gpu instance can be created and using command line, i can create gi and ci.
Which commands (I assume nvidia-smi
) do you use?
yes, "sudo nvidia-smi mig -i 0 -cgi 19 -C",i can create gi and ci;
but by using nvml go lib, i can just create gi, when using CreateComputeInstance() to create ci, it fail
Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index 1
wherease the nvidia-smi
command accesses device 0
.
For what it's worth, we use the following flow to create a compute instance:
gi, ret = device.CreateGpuInstance(&giProfileInfo)
if ret != nvml.SUCCESS {
return fmt.Errorf("error creating GPU instance: %v", ret)
}
ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0)
if ret != nvml.SUCCESS {
return fmt.Errorf("error getting Compute instance profile info for: %v", ret)
}
_, ret = gi.CreateComputeInstance(&ciProfileInfo)
if ret != nvml.SUCCESS {
return fmt.Errorf("error creating Compute instance: %v", ret)
}
in some of our toolking. Note the call to GetComputeInstanceProfileInfo
. I believe you can use 0
for both
yes,this node has 4 a100.i will try GetComputeInstanceProfileInfo to create ci profileinfo. thanks
Just as a sanity check. Do you have multiple A100 devices available? In the example you show above you access a device with index
1
wherease thenvidia-smi
command accesses device0
.For what it's worth, we use the following flow to create a compute instance:
gi, ret = device.CreateGpuInstance(&giProfileInfo) if ret != nvml.SUCCESS { return fmt.Errorf("error creating GPU instance: %v", ret) } ciProfileInfo, ret := gi.GetComputeInstanceProfileInfo(0, 0) if ret != nvml.SUCCESS { return fmt.Errorf("error getting Compute instance profile info for: %v", ret) } _, ret = gi.CreateComputeInstance(&ciProfileInfo) if ret != nvml.SUCCESS { return fmt.Errorf("error creating Compute instance: %v", ret) }
in some of our toolking. Note the call to
GetComputeInstanceProfileInfo
. I believe you can use0
for both
it works. thank you very much. so i guess the problem is creating ComputeInstanceProfileInfo by myself. but according to the source code, the go lib only pass ComputeInstanceProfileInfo.Id to the nvml.
As far as I am aware, there are different versions of the ComputeInstanceProfileInfo
struct -- or at least a versioned struct was introduced recently. It may be that constructing it manually created the wrong version and this is what caused it error.
Actually, it may also be that the ProfileID
and the info ID are not the same. Could you output the returned ciProfileInfo
struct and confirm its .ID
value?