TMA: Pause-loop is not classified at all levels
aayasin opened this issue · comments
Ahmad Yasin commented
A rather simple pause-loop kernel is classified properly as Core Bound at levels 1 & 2 and so in levels 5 and 6, but not at the mid-levels 4 and 5.
I am documenting this here and will address it in TMA 4.2 release.
Here is a reproducer with perf-tools.
P.S. @andikleen: This is a CFL machine (8th gen Core). Can that be reflected instead of [skl]
in 1st line of toplev output?
$ ./kernels/gen-kernel.py -i pause > ./kernels/pause3x.c
$ gcc -g -O2 -o ./kernels/pause3x ./kernels/pause3x.c
$ ./pmu-tools/toplev.py --no-desc --no-perf --nodes '+CoreIPC,+UPI,+Time,+MUX' -vl6 -- ./kernels/pause3x 10000000 2>&1 | egrep -v ' [10]\.. '
# 4.11-full-perf on Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz [skl]
BE Backend_Bound % Slots 98.9
BE/Core Backend_Bound.Core_Bound % Slots 98.8 <==
FE Frontend_Bound.Fetch_Latency.MS_Switches % Clocks 2.6 <
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 9.9 <
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0 % Clocks 8.5 <
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation % Clocks 98.9 <
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation.Slow_Pause % Clocks 91.3 <
Info.Thread UPI Metric 2.9
RET Retiring.Light_Operations.Other % Uops 100.0 <
MUX % 2.2
Using level 6.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-5
Off-line CPU(s) list: 6-11
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Stepping: 10
CPU MHz: 3701.500
CPU max MHz: 3700.0000
CPU min MHz: 800.0000
BogoMIPS: 7399.70
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
Andi Kleen commented
I fixed the reporting for CFL/KBL (but nothing else atm)
Ahmad Yasin commented
With TMA 4.2 release, I just verified this issue is indeed resolved on a CLX system.
This entry can be closed.
[labuser@ssp-wpclx-cdi276 perf-tools]$ python2.7 ./do.py build profile -g "-i PAUSE -n 3" -a pause3x -ki 1e8 --profile-mask 0x40 -v1 --pmu-tools '/bin/python2.7 ./pmu-tools/'
building kernel: pause3x ..
./kernels/gen-kernel.py -i PAUSE -n 3 > ./kernels/pause3x.c 2>&1
gcc -g -O2 -o ./kernels/pause3x ./kernels/pause3x.c 2>&1
topdown auto-drilldown ..
/bin/python2.7 ./pmu-tools//toplev.py --no-desc --drilldown --nodes '+CoreIPC,+Instructions,+CORE_CLKS,+CPU_Utilization,+Time,+MUX,+IpTB,+L2MPKI' -V pause3x-1e8.toplev--drilldown-perf.csv --metric-group +Summary,+HPC -- taskset 0x4 ./kernels/pause3x 100000000 2>&1 | tee pause3x-1e8.toplev--drilldown.log | egrep -v "^(Run toplev|Adding|Using)"
2 events not supported
# 4.19-full-perf on Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz [clx/skylake]
BE Backend_Bound % Slots 95.9 <==
Info.Core CoreIPC CoreMetric 0.1
Info.Inst_Mix Instructions Count 621,945,279.0
Info.Thread IPC Metric 0.1
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
Info.Thread IpTB Metric 6.1
Info.Core CORE_CLKS Count 12,281,209,061.0
Info.Memory L2MPKI Metric 0.1
MUX % 9.2
Rerunning workload
BE Backend_Bound % Slots 95.9
Info.Core CoreIPC CoreMetric 0.0
Info.Inst_Mix Instructions Count 609,719,766.0
BE/Core Backend_Bound.Core_Bound % Slots 95.9 <==
Info.Thread IpTB Metric 6.0
Info.Core CORE_CLKS Count 12,241,658,267.0
Info.Memory L2MPKI Metric 0.0
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
MUX % 18.3
Rerunning workload
BE Backend_Bound % Slots 95.9
Info.Core CoreIPC CoreMetric 0.1
Info.Inst_Mix Instructions Count 611,825,162.0
BE/Core Backend_Bound.Core_Bound % Slots 95.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 95.2 <==
Info.Thread IpTB Metric 6.1
Info.Core CORE_CLKS Count 12,197,285,099.0
Info.Memory L2MPKI Metric 0.1
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
MUX % 12.2
Rerunning workload
BE Backend_Bound % Slots 95.9
Info.Core CoreIPC CoreMetric 0.1
Info.Inst_Mix Instructions Count 615,664,622.0
BE/Core Backend_Bound.Core_Bound % Slots 95.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 95.1
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0 % Clocks 91.0 <==
Info.Thread IpTB Metric 6.1
Info.Core CORE_CLKS Count 12,247,617,237.0
Info.Memory L2MPKI Metric 0.1
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
MUX % 12.2
Rerunning workload
BE Backend_Bound % Slots 95.9
Info.Core CoreIPC CoreMetric 0.1
Info.Inst_Mix Instructions Count 613,116,500.0
BE/Core Backend_Bound.Core_Bound % Slots 95.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 95.5
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0 % Clocks 91.0
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation % Clocks 95.9 <==
Info.Thread IpTB Metric 6.1
Info.Core CORE_CLKS Count 12,216,356,878.0
Info.Memory L2MPKI Metric 0.1
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
MUX % 12.2
Rerunning workload
BE Backend_Bound % Slots 95.9
Info.Core CoreIPC CoreMetric 0.1
Info.Inst_Mix Instructions Count 616,309,280.0
BE/Core Backend_Bound.Core_Bound % Slots 95.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 94.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0 % Clocks 91.0
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation % Clocks 95.9
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_0.Serializing_Operation.Slow_Pause % Clocks 97.7 <==
Info.Thread IpTB Metric 6.1
Info.Core CORE_CLKS Count 12,256,754,613.0
Info.Memory L2MPKI Metric 0.1
Info.System CPU_Utilization Metric 1.0
Info.System Time Seconds 3.3
MUX
Andi Kleen commented
Fixed