Tally issue
Kerilk opened this issue · comments
Device profiling is missing a min here no idea why.
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandListAppendBarrier | 37.02us | 54.19% | 61 | 606.95ns | | 2.39us |
addOne | 13.00us | 19.03% | 2 | 6.50us | 2.81us | 10.19us |
zeMemoryCopy(MD) | 10.82us | 15.83% | 3 | 3.61us | 3.43us | 3.95us |
zeMemoryCopy(SM) | 3.43us | 5.02% | 1 | 3.43us | 3.43us | 3.43us |
zeMemoryCopy(DM) | 3.02us | 4.41% | 1 | 3.02us | 3.02us | 3.02us |
zeCommandListAppendWriteGlobalTimestamp | 1.04us | 1.52% | 1 | 1.04us | 1.04us | 1.04us |
Total | 68.33us | 100.00% | 69 |
Tagging @Sarbojit2019
If you can share the trace, it will be highly appreciated! And sorry for this bug .
I think the zeCommandListAppendBarrier
returned immediately, so the time was 0
, so I didn't display it.
Here is the trace file
iprof-20230717-205304.zip
Confirming the fix
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandListAppendBarrier | 37.02us | 54.19% | 61 | 606.95ns | 0ns | 2.39us |
addOne | 13.00us | 19.03% | 2 | 6.50us | 2.81us | 10.19us |
zeMemoryCopy(MD) | 10.82us | 15.83% | 3 | 3.61us | 3.43us | 3.95us |
zeMemoryCopy(SM) | 3.43us | 5.02% | 1 | 3.43us | 3.43us | 3.43us |
zeMemoryCopy(DM) | 3.02us | 4.41% | 1 | 3.02us | 3.02us | 3.02us |
zeCommandListAppendWriteGlobalTimestamp | 1.04us | 1.52% | 1 | 1.04us | 1.04us | 1.04us |
Total | 68.33us | 100.00% | 69 |
Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads |
Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeMemoryCopy(SM) | 8B | 44.44% | 1 | 8.00B | 8B | 8B |
zeMemoryCopy(MD) | 6B | 33.33% | 3 | 2.00B | 1B | 4B |
zeMemoryCopy(DM) | 4B | 22.22% | 1 | 4.00B | 4B | 4B |
Total | 18B | 100.00% | 5 |
Trying to understand why some barrier have 0ns. This is maybe another bug :)
[15:23:03.833210296] (+0.000001444) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0D14E0, status = 0, timestampStatus = 0, globalStart = 0, globalEnd = 0, contextStart = 0, contextEnd = 0 }
Something is fishy in the trace. Either us or a L0 bug...
2000:[15:22:27.167908299] (+0.000000243) AELAB407 lttng_ust_ze:zeEventCreate_exit: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { zeResult = 0, phEvent_val = 0x5560BE0CD580 }
2239:[15:22:37.187576393] (+0.000000206) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2241:[15:22:37.187577068] (+0.000000245) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BE17B8A0, hSignalEvent = 0x5560BE0CD580, numWaitEvents = 0, phWaitEvents = 0x0, _phWaitEvents_vals_length = 0, phWaitEvents_vals = [ ] }
2243:[15:22:37.187580716] (+0.000000392) AELAB407 lttng_ust_ze_profiling:event_profiling: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2287:[15:22:47.207122085] (+0.000000150) AELAB407 lttng_ust_ze:zeEventQueryStatus_entry: { cpu_id = 14 }, { vpid = 1489727, vtid = 1489731 }, { hEvent = 0x5560BE0CD580 }
2369:[15:22:47.246082957] (+0.000001546) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2370:[15:22:47.246083520] (+0.000000563) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 2023685110, globalEnd = 2023685120, contextStart = 2023685110, contextEnd = 2023685120 }
2374:[15:22:47.246085177] (+0.000000113) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BE8F9C80, hSignalEvent = 0x5560BE0CD580, numWaitEvents = 0, phWaitEvents = 0x0, _phWaitEvents_vals_length = 0, phWaitEvents_vals = [ ] }
2376:[15:22:47.246089623] (+0.000000292) AELAB407 lttng_ust_ze_profiling:event_profiling: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2557:[15:22:49.247437735] (+0.000000265) AELAB407 lttng_ust_ze:zeEventQueryStatus_entry: { cpu_id = 14 }, { vpid = 1489727, vtid = 1489731 }, { hEvent = 0x5560BE0CD580 }
2627:[15:22:49.247624657] (+0.000000628) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2628:[15:22:49.247625006] (+0.000000349) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 2024435386, globalEnd = 2024435398, contextStart = 2024435386, contextEnd = 2024435398 }
2655:[15:22:49.247643695] (+0.000000092) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BEC9E940, hSignalEvent = 0x5560BE0CD0A0, numWaitEvents = 1, phWaitEvents = 0x5560BDF2E870, _phWaitEvents_vals_length = 1, phWaitEvents_vals = [ [0] = 0x5560BE0CD580 ] }
2712:[15:22:59.267542555] (+10.000002667) AELAB407 lttng_ust_ze:zeEventHostSignal_entry: { cpu_id = 0 }, { vpid = 1489727, vtid = 1489732 }, { hEvent = 0x5560BE0CD580 }
5117:[15:23:03.833171595] (+0.000000616) AELAB407 lttng_ust_ze:zeEventDestroy_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
5118:[15:23:03.833173208] (+0.000001613) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 0, globalEnd = 0, contextStart = 0, contextEnd = 0 }
Maybe due to the zeEventHostSignal_entry
before getting the event_profiling
?
No, if you see above the event was used successfully for 2 barriers (we get the result during the reset). Then it is used as a dependency of a barrier. So this info should not have been associated with the barrier.
@Sarbojit2019 Issue fixed, you can also split the table for hip/lz stuff
applenco@uan-0002:~/THAPI/build> ./ici/bin/iprof --backend-level "ze:1,hip:0" -r ~/iprof-20230717-205304/
BACKEND_HIP | 1 Hostnames | 1 Processes | 1 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max | Error |
hipStreamSynchronize | 12.04s | 32.83% | 2 | 6.02s | 2.02s | 10.02s | 0 |
hipMemcpy | 10.02s | 27.32% | 1 | 10.02s | 10.02s | 10.02s | 0 |
hipDeviceSynchronize | 10.02s | 27.31% | 5 | 2.00s | 642ns | 10.02s | 0 |
hipStreamDestroy | 2.02s | 5.51% | 7 | 288.57ms | 1.08us | 2.02s | 0 |
hipMalloc | 2.00s | 5.46% | 2 | 1.00s | 278.64us | 2.00s | 0 |
__hipUnregisterFatBinary | 525.43ms | 1.43% | 1 | 525.43ms | 525.43ms | 525.43ms | 0 |
hipLaunchKernel | 39.07ms | 0.11% | 2 | 19.53ms | 141.53us | 38.93ms | 0 |
__hipRegisterFatBinary | 10.67ms | 0.03% | 1 | 10.67ms | 10.67ms | 10.67ms | 0 |
hipStreamAddCallback | 2.45ms | 0.01% | 6 | 409.10us | 98.18us | 1.47ms | 0 |
hipStreamCreate | 656.54us | 0.00% | 6 | 109.42us | 16.38us | 483.16us | 0 |
hipEventRecord | 105.30us | 0.00% | 1 | 105.30us | 105.30us | 105.30us | 0 |
hipMemcpyAsync | 72.30us | 0.00% | 1 | 72.30us | 72.30us | 72.30us | 0 |
hipStreamCreateWithFlags | 29.31us | 0.00% | 1 | 29.31us | 29.31us | 29.31us | 0 |
hipStreamWaitEvent | 18.00us | 0.00% | 1 | 18.00us | 18.00us | 18.00us | 0 |
hipFree | 14.15us | 0.00% | 1 | 14.15us | 14.15us | 14.15us | 0 |
hipEventCreate | 9.06us | 0.00% | 1 | 9.06us | 9.06us | 9.06us | 0 |
__hipPushCallConfiguration | 8.94us | 0.00% | 2 | 4.47us | 2.90us | 6.04us | 0 |
__hipPopCallConfiguration | 4.58us | 0.00% | 2 | 2.29us | 2.00us | 2.59us | 0 |
__hipRegisterFunction | 2.85us | 0.00% | 1 | 2.85us | 2.85us | 2.85us | 0 |
hipStreamQuery | 2.09us | 0.00% | 2 | 2.09us | 2.09us | 2.09us | 1 |
__hipRegisterVar | 1.11us | 0.00% | 1 | 1.11us | 1.11us | 1.11us | 0 |
hipGetLastError | 602ns | 0.00% | 2 | 301.00ns | 103ns | 499ns | 0 |
Total | 36.68s | 100.00% | 49 | 1 |
BACKEND_ZE | 1 Hostnames | 1 Processes | 3 Threads |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandQueueSynchronize | 36.10s | 99.88% | 32 | 1.13s | 97ns | 10.02s |
zeModuleCreate | 38.48ms | 0.11% | 1 | 38.48ms | 38.48ms | 38.48ms |
zeCommandQueueExecuteCommandLists | 1.55ms | 0.00% | 34 | 45.49us | 4.27us | 227.47us |
zeCommandListAppendBarrier | 735.86us | 0.00% | 58 | 12.69us | 3.62us | 250.18us |
zeEventDestroy | 671.46us | 0.00% | 1001 | 670.79ns | 531ns | 9.80us |
zeEventCreate | 625.86us | 0.00% | 1001 | 625.23ns | 231ns | 8.59us |
zeCommandQueueCreate | 366.35us | 0.00% | 8 | 45.79us | 12.01us | 106.05us |
zeCommandListAppendMemoryCopy | 231.98us | 0.00% | 5 | 46.40us | 17.53us | 143.73us |
zeCommandListCreate | 209.42us | 0.00% | 34 | 6.16us | 779ns | 24.70us |
zeCommandListDestroy | 167.72us | 0.00% | 34 | 4.93us | 862ns | 25.36us |
zeEventPoolDestroy | 100.66us | 0.00% | 1 | 100.66us | 100.66us | 100.66us |
zeContextDestroy | 85.15us | 0.00% | 1 | 85.15us | 85.15us | 85.15us |
zeEventQueryStatus | 59.06us | 0.00% | 58 | 1.02us | 225ns | 7.50us |
zeEventHostReset | 45.26us | 0.00% | 75 | 603.41ns | 193ns | 2.86us |
zeEventHostSynchronize | 41.54us | 0.00% | 6 | 6.92us | 6.00us | 7.37us |
zeMemAllocShared | 39.34us | 0.00% | 8 | 4.92us | 1.84us | 8.81us |
zeMemAllocDevice | 31.67us | 0.00% | 2 | 15.84us | 14.65us | 17.02us |
zeCommandListAppendLaunchKernel | 15.11us | 0.00% | 2 | 7.55us | 6.07us | 9.04us |
zeCommandListClose | 13.33us | 0.00% | 34 | 392.15ns | 127ns | 4.12us |
zeMemFree | 10.14us | 0.00% | 1 | 10.14us | 10.14us | 10.14us |
zeCommandQueueDestroy | 7.04us | 0.00% | 8 | 880.38ns | 166ns | 3.18us |
zeEventPoolCreate | 6.98us | 0.00% | 2 | 3.49us | 3.16us | 3.83us |
zeCommandListAppendWriteGlobalTimestamp | 6.12us | 0.00% | 1 | 6.12us | 6.12us | 6.12us |
zeEventHostSignal | 5.32us | 0.00% | 6 | 887.33ns | 756ns | 1.13us |
zeDeviceGet | 4.91us | 0.00% | 2 | 2.45us | 2.10us | 2.81us |
zeCommandListAppendSignalEvent | 4.41us | 0.00% | 6 | 735.17ns | 390ns | 1.08us |
zeKernelCreate | 3.93us | 0.00% | 1 | 3.93us | 3.93us | 3.93us |
zeDeviceGetGlobalTimestamps | 3.51us | 0.00% | 1 | 3.51us | 3.51us | 3.51us |
zeKernelSetGroupSize | 1.60us | 0.00% | 2 | 801.50ns | 751ns | 852ns |
zeContextCreateEx | 1.00us | 0.00% | 1 | 1.00us | 1.00us | 1.00us |
zeKernelSetIndirectAccess | 863ns | 0.00% | 2 | 431.50ns | 386ns | 477ns |
zeDriverGet | 673ns | 0.00% | 2 | 336.50ns | 142ns | 531ns |
zeInit | 532ns | 0.00% | 1 | 532.00ns | 532ns | 532ns |
zeModuleGetKernelNames | 508ns | 0.00% | 2 | 254.00ns | 122ns | 386ns |
Total | 36.14s | 100.00% | 2433 |
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |
Name | Time | Time(%) | Calls | Average | Min | Max |
zeCommandListAppendBarrier | 37.02us | 54.19% | 58 | 638.34ns | 416ns | 2.39us |
addOne | 13.00us | 19.03% | 2 | 6.50us | 2.81us | 10.19us |
zeMemoryCopy(MD) | 10.82us | 15.83% | 3 | 3.61us | 3.43us | 3.95us |
zeMemoryCopy(SM) | 3.43us | 5.02% | 1 | 3.43us | 3.43us | 3.43us |
zeMemoryCopy(DM) | 3.02us | 4.41% | 1 | 3.02us | 3.02us | 3.02us |
zeCommandListAppendWriteGlobalTimestamp | 1.04us | 1.52% | 1 | 1.04us | 1.04us | 1.04us |
Total | 68.33us | 100.00% | 66 |
Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads |
Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeMemoryCopy(SM) | 8B | 44.44% | 1 | 8.00B | 8B | 8B |
zeMemoryCopy(MD) | 6B | 33.33% | 3 | 2.00B | 1B | 4B |
zeMemoryCopy(DM) | 4B | 22.22% | 1 | 4.00B | 4B | 4B |
Total | 18B | 100.00% | 5 |
reinstalling should be as easy as spack uninstall thapi@master
followed by spack install thapi@master