Unable to get various stats in rocm 2.9
misos1 opened this issue · comments
================================================================================
ERROR: GPU[0] : Unable to get maximum Graphics Package Power
ERROR: GPU[1] : Unable to get maximum Graphics Package Power
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get Power Profile
ERROR: GPU[1] : Unable to get Power Profile
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get Average Graphics Package Power Consumption
ERROR: GPU[1] : Unable to get Average Graphics Package Power Consumption
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU use.
ERROR: GPU[1] : Unable to get GPU use.
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get PCIe replay count
ERROR: GPU[1] : Unable to get PCIe replay count
================================================================================
================================================================================
GPU[0] : Unique ID: N/A
GPU[1] : Unique ID: N/A
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
GPU[1] : Serial Number: N/A
================================================================================
================================================================================
ERROR: GPU[0] : Unable to display PowerPlay table
ERROR: GPU[1] : Unable to display PowerPlay table
================================================================================
================================================================================
ERROR: GPU[0] : Unable to display voltage
ERROR: GPU[1] : Unable to display voltage
================================================================================
================================================================================
================================================================================
GPU[0] : Unable to get voltage curve
GPU[1] : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================
Also now there is again only one temperature instead of 3 as before (junction, ...):
GPU[0] : Temperature (Sensor #1) (C): 34.0
GPU[1] : Temperature (Sensor #1) (C): 27.0
Looks like your GPU doesn't support that functionality. What GPU do you have?
With previous versions of rocm like 2.8 were almost all of these entries available:
================================================================================
GPU[0] : Max Graphics Package Power (W): 264.0
GPU[1] : Max Graphics Package Power (W): 220.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
GPU[1] :
GPU[1] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[1] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[1] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[1] : 2 POWER_SAVING : 90 60 0 0
GPU[1] : 3 VIDEO : 70 60 0 0
GPU[1] : 4 VR : 70 90 0 0
GPU[1] : 5 COMPUTE : 30 60 0 6
GPU[1] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 3.0
GPU[1] : Average Graphics Package Power (W): 3.0
================================================================================
================================================================================
GPU[0] : GPU use (%): 0
GPU[1] : GPU use (%): 0
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
GPU[1] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0215054ab5c808c4
GPU[1] : Unique ID: 0213fbda0ae038a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
GPU[1] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
ERROR: GPU[0] : Unable to display PowerPlay table
ERROR: GPU[1] : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0] : Voltage (mV): 750
GPU[1] : Voltage (mV): 750
================================================================================
================================================================================
================================================================================
==============================End of ROCm SMI Log ==============================
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (C): 28.0
GPU[0] : Temperature (Sensor junction) (C): 28.0
GPU[0] : Temperature (Sensor mem) (C): 27.0
GPU[1] : Temperature (Sensor edge) (C): 26.0
GPU[1] : Temperature (Sensor junction) (C): 27.0
GPU[1] : Temperature (Sensor mem) (C): 25.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 0 (167Mhz)
GPU[0] : pcie clock level: 0 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 0 (600Mhz)
================================================================================
GPU[1] : dcefclk clock level: 0 (600Mhz)
GPU[1] : mclk clock level: 0 (167Mhz)
GPU[1] : pcie clock level: 0 (8.0GT/s, x16)
GPU[1] : sclk clock level: 0 (852Mhz)
GPU[1] : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
That's definitely concerning then. What GPU have you got? Maybe we hit a regression with the firmware or in the kernel code, since there won't be anything runtime-related that could've caused this, it's all sysfs and kernel. rocm-smi -i should be enough to get me looking as to what firmware it could be
Oh sorry I had either somehow corrupted installation or it needed to reboot. I now reinstalled rocm and before rebooting it looked like I posted in beginning. But after reboot this problem disappeared. Probably firmware or kernel module needed to be loaded. GPUs are Vega 10 XT and Vega 10 XTX.
GPU[0] : GPU ID: 0x687f
GPU[1] : GPU ID: 0x6863
Only things which rocm-smi does not show are these:
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
================================================================================
================================================================================
ERROR: GPU[0] : Unable to display PowerPlay table
ERROR: GPU[1] : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0] : Unable to get voltage curve
GPU[1] : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================
But this was like this also before so probably my GPUs do not support them.
Glad to see that things are working properly. Voltage Curve/PP Table is Vega20 non-server only. GPU Memory Use I think is only VG20-and-later as well, since it's not in Vega10's SMU firmware. So that seems to be "functioning as expected"
Is GPU memory use
not based on values shown with --showmeminfo
? Because this is little strange:
$ rocm-smi --showmeminfo all
GPU[0] : vram Total Memory (B): 8573157376
GPU[0] : vram Total Used Memory (B): 140845056
GPU[0] : vis_vram Total Memory (B): 268435456
GPU[0] : vis_vram Total Used Memory (B): 15654912
GPU[0] : gtt Total Memory (B): 67363909632
GPU[0] : gtt Total Used Memory (B): 147021824
GPU[1] : vram Total Memory (B): 17163091968
GPU[1] : vram Total Used Memory (B): 199143424
GPU[1] : vis_vram Total Memory (B): 268435456
GPU[1] : vis_vram Total Used Memory (B): 22470656
GPU[1] : gtt Total Memory (B): 67363909632
GPU[1] : gtt Total Used Memory (B): 26001408
But
$ rocm-smi --showmemuse
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
And concise output somehow knows VRAM%
:
$ rocm-smi
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 71.0c 96.0W 1302Mhz 945Mhz 23.92% auto 264.0W 2% 91%
1 64.0c 10.0W 1269Mhz 945Mhz 16.86% auto 220.0W 0% 0%
GPU memory use is more accurately described as "GPU memory busy rate". Basically it polls the GPU X number of times to see if the memory block is in use. If so, it's a 1. If not, it's a 0. Let's say you had 10 polls with 5 1s and 5 0s, that would be a busy rate (memory utilization rate) of 50%;
If you had a single memory allocation of all of VRAM, then your utilization would be 1%, but the VRAM used would be 100% . I know, the wording is confusing, but both metrics have useful applications, it's just not explained too clearly outside of the kernel documentation