argonne-lcf / THAPI

A tracing infrastructure for heterogeneous computing applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tally issue

Kerilk opened this issue · comments

Device profiling is missing a min here no idea why.

Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |
 
                                   Name |    Time | Time(%) | Calls |  Average |    Min |     Max |
             zeCommandListAppendBarrier | 37.02us |  54.19% |    61 | 606.95ns |        |  2.39us |
                                 addOne | 13.00us |  19.03% |     2 |   6.50us | 2.81us | 10.19us |
                       zeMemoryCopy(MD) | 10.82us |  15.83% |     3 |   3.61us | 3.43us |  3.95us |
                       zeMemoryCopy(SM) |  3.43us |   5.02% |     1 |   3.43us | 3.43us |  3.43us |
                       zeMemoryCopy(DM) |  3.02us |   4.41% |     1 |   3.02us | 3.02us |  3.02us |
zeCommandListAppendWriteGlobalTimestamp |  1.04us |   1.52% |     1 |   1.04us | 1.04us |  1.04us |
                                  Total | 68.33us | 100.00% |    69 |

1ee2c48

If you can share the trace, it will be highly appreciated! And sorry for this bug .
I think the zeCommandListAppendBarrier returned immediately, so the time was 0, so I didn't display it.

Here is the trace file
iprof-20230717-205304.zip

Confirming the fix

Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |

                                   Name |    Time | Time(%) | Calls |  Average |    Min |     Max |
             zeCommandListAppendBarrier | 37.02us |  54.19% |    61 | 606.95ns |    0ns |  2.39us |
                                 addOne | 13.00us |  19.03% |     2 |   6.50us | 2.81us | 10.19us |
                       zeMemoryCopy(MD) | 10.82us |  15.83% |     3 |   3.61us | 3.43us |  3.95us |
                       zeMemoryCopy(SM) |  3.43us |   5.02% |     1 |   3.43us | 3.43us |  3.43us |
                       zeMemoryCopy(DM) |  3.02us |   4.41% |     1 |   3.02us | 3.02us |  3.02us |
zeCommandListAppendWriteGlobalTimestamp |  1.04us |   1.52% |     1 |   1.04us | 1.04us |  1.04us |
                                  Total | 68.33us | 100.00% |    69 |

Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads |

            Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeMemoryCopy(SM) |   8B |  44.44% |     1 |   8.00B |  8B |  8B |
zeMemoryCopy(MD) |   6B |  33.33% |     3 |   2.00B |  1B |  4B |
zeMemoryCopy(DM) |   4B |  22.22% |     1 |   4.00B |  4B |  4B |
           Total |  18B | 100.00% |     5 |

Trying to understand why some barrier have 0ns. This is maybe another bug :)

@Kerilk

[15:23:03.833210296] (+0.000001444) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0D14E0, status = 0, timestampStatus = 0, globalStart = 0, globalEnd = 0, contextStart = 0, contextEnd = 0 }

Something is fishy in the trace. Either us or a L0 bug...

2000:[15:22:27.167908299] (+0.000000243) AELAB407 lttng_ust_ze:zeEventCreate_exit: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { zeResult = 0, phEvent_val = 0x5560BE0CD580 }
2239:[15:22:37.187576393] (+0.000000206) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2241:[15:22:37.187577068] (+0.000000245) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BE17B8A0, hSignalEvent = 0x5560BE0CD580, numWaitEvents = 0, phWaitEvents = 0x0, _phWaitEvents_vals_length = 0, phWaitEvents_vals = [ ] }
2243:[15:22:37.187580716] (+0.000000392) AELAB407 lttng_ust_ze_profiling:event_profiling: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2287:[15:22:47.207122085] (+0.000000150) AELAB407 lttng_ust_ze:zeEventQueryStatus_entry: { cpu_id = 14 }, { vpid = 1489727, vtid = 1489731 }, { hEvent = 0x5560BE0CD580 }
2369:[15:22:47.246082957] (+0.000001546) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2370:[15:22:47.246083520] (+0.000000563) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 2023685110, globalEnd = 2023685120, contextStart = 2023685110, contextEnd = 2023685120 }
2374:[15:22:47.246085177] (+0.000000113) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BE8F9C80, hSignalEvent = 0x5560BE0CD580, numWaitEvents = 0, phWaitEvents = 0x0, _phWaitEvents_vals_length = 0, phWaitEvents_vals = [ ] }
2376:[15:22:47.246089623] (+0.000000292) AELAB407 lttng_ust_ze_profiling:event_profiling: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2557:[15:22:49.247437735] (+0.000000265) AELAB407 lttng_ust_ze:zeEventQueryStatus_entry: { cpu_id = 14 }, { vpid = 1489727, vtid = 1489731 }, { hEvent = 0x5560BE0CD580 }
2627:[15:22:49.247624657] (+0.000000628) AELAB407 lttng_ust_ze:zeEventHostReset_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
2628:[15:22:49.247625006] (+0.000000349) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 2024435386, globalEnd = 2024435398, contextStart = 2024435386, contextEnd = 2024435398 }
2655:[15:22:49.247643695] (+0.000000092) AELAB407 lttng_ust_ze:zeCommandListAppendBarrier_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hCommandList = 0x5560BEC9E940, hSignalEvent = 0x5560BE0CD0A0, numWaitEvents = 1, phWaitEvents = 0x5560BDF2E870, _phWaitEvents_vals_length = 1, phWaitEvents_vals = [ [0] = 0x5560BE0CD580 ] }
2712:[15:22:59.267542555] (+10.000002667) AELAB407 lttng_ust_ze:zeEventHostSignal_entry: { cpu_id = 0 }, { vpid = 1489727, vtid = 1489732 }, { hEvent = 0x5560BE0CD580 }
5117:[15:23:03.833171595] (+0.000000616) AELAB407 lttng_ust_ze:zeEventDestroy_entry: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580 }
5118:[15:23:03.833173208] (+0.000001613) AELAB407 lttng_ust_ze_profiling:event_profiling_results: { cpu_id = 10 }, { vpid = 1489727, vtid = 1489727 }, { hEvent = 0x5560BE0CD580, status = 0, timestampStatus = 0, globalStart = 0, globalEnd = 0, contextStart = 0, contextEnd = 0 }

Maybe due to the zeEventHostSignal_entry before getting the event_profiling ?

No, if you see above the event was used successfully for 2 barriers (we get the result during the reset). Then it is used as a dependency of a barrier. So this info should not have been associated with the barrier.

@Sarbojit2019 Issue fixed, you can also split the table for hip/lz stuff

applenco@uan-0002:~/THAPI/build> ./ici/bin/iprof --backend-level "ze:1,hip:0"  -r ~/iprof-20230717-205304/
BACKEND_HIP | 1 Hostnames | 1 Processes | 1 Threads |

                      Name |     Time | Time(%) | Calls |  Average |      Min |      Max | Error |
      hipStreamSynchronize |   12.04s |  32.83% |     2 |    6.02s |    2.02s |   10.02s |     0 |
                 hipMemcpy |   10.02s |  27.32% |     1 |   10.02s |   10.02s |   10.02s |     0 |
      hipDeviceSynchronize |   10.02s |  27.31% |     5 |    2.00s |    642ns |   10.02s |     0 |
          hipStreamDestroy |    2.02s |   5.51% |     7 | 288.57ms |   1.08us |    2.02s |     0 |
                 hipMalloc |    2.00s |   5.46% |     2 |    1.00s | 278.64us |    2.00s |     0 |
  __hipUnregisterFatBinary | 525.43ms |   1.43% |     1 | 525.43ms | 525.43ms | 525.43ms |     0 |
           hipLaunchKernel |  39.07ms |   0.11% |     2 |  19.53ms | 141.53us |  38.93ms |     0 |
    __hipRegisterFatBinary |  10.67ms |   0.03% |     1 |  10.67ms |  10.67ms |  10.67ms |     0 |
      hipStreamAddCallback |   2.45ms |   0.01% |     6 | 409.10us |  98.18us |   1.47ms |     0 |
           hipStreamCreate | 656.54us |   0.00% |     6 | 109.42us |  16.38us | 483.16us |     0 |
            hipEventRecord | 105.30us |   0.00% |     1 | 105.30us | 105.30us | 105.30us |     0 |
            hipMemcpyAsync |  72.30us |   0.00% |     1 |  72.30us |  72.30us |  72.30us |     0 |
  hipStreamCreateWithFlags |  29.31us |   0.00% |     1 |  29.31us |  29.31us |  29.31us |     0 |
        hipStreamWaitEvent |  18.00us |   0.00% |     1 |  18.00us |  18.00us |  18.00us |     0 |
                   hipFree |  14.15us |   0.00% |     1 |  14.15us |  14.15us |  14.15us |     0 |
            hipEventCreate |   9.06us |   0.00% |     1 |   9.06us |   9.06us |   9.06us |     0 |
__hipPushCallConfiguration |   8.94us |   0.00% |     2 |   4.47us |   2.90us |   6.04us |     0 |
 __hipPopCallConfiguration |   4.58us |   0.00% |     2 |   2.29us |   2.00us |   2.59us |     0 |
     __hipRegisterFunction |   2.85us |   0.00% |     1 |   2.85us |   2.85us |   2.85us |     0 |
            hipStreamQuery |   2.09us |   0.00% |     2 |   2.09us |   2.09us |   2.09us |     1 |
          __hipRegisterVar |   1.11us |   0.00% |     1 |   1.11us |   1.11us |   1.11us |     0 |
           hipGetLastError |    602ns |   0.00% |     2 | 301.00ns |    103ns |    499ns |     0 |
                     Total |   36.68s | 100.00% |    49 |                                      1 |

BACKEND_ZE | 1 Hostnames | 1 Processes | 3 Threads |

                                   Name |     Time | Time(%) | Calls |  Average |      Min |      Max |
              zeCommandQueueSynchronize |   36.10s |  99.88% |    32 |    1.13s |     97ns |   10.02s |
                         zeModuleCreate |  38.48ms |   0.11% |     1 |  38.48ms |  38.48ms |  38.48ms |
      zeCommandQueueExecuteCommandLists |   1.55ms |   0.00% |    34 |  45.49us |   4.27us | 227.47us |
             zeCommandListAppendBarrier | 735.86us |   0.00% |    58 |  12.69us |   3.62us | 250.18us |
                         zeEventDestroy | 671.46us |   0.00% |  1001 | 670.79ns |    531ns |   9.80us |
                          zeEventCreate | 625.86us |   0.00% |  1001 | 625.23ns |    231ns |   8.59us |
                   zeCommandQueueCreate | 366.35us |   0.00% |     8 |  45.79us |  12.01us | 106.05us |
          zeCommandListAppendMemoryCopy | 231.98us |   0.00% |     5 |  46.40us |  17.53us | 143.73us |
                    zeCommandListCreate | 209.42us |   0.00% |    34 |   6.16us |    779ns |  24.70us |
                   zeCommandListDestroy | 167.72us |   0.00% |    34 |   4.93us |    862ns |  25.36us |
                     zeEventPoolDestroy | 100.66us |   0.00% |     1 | 100.66us | 100.66us | 100.66us |
                       zeContextDestroy |  85.15us |   0.00% |     1 |  85.15us |  85.15us |  85.15us |
                     zeEventQueryStatus |  59.06us |   0.00% |    58 |   1.02us |    225ns |   7.50us |
                       zeEventHostReset |  45.26us |   0.00% |    75 | 603.41ns |    193ns |   2.86us |
                 zeEventHostSynchronize |  41.54us |   0.00% |     6 |   6.92us |   6.00us |   7.37us |
                       zeMemAllocShared |  39.34us |   0.00% |     8 |   4.92us |   1.84us |   8.81us |
                       zeMemAllocDevice |  31.67us |   0.00% |     2 |  15.84us |  14.65us |  17.02us |
        zeCommandListAppendLaunchKernel |  15.11us |   0.00% |     2 |   7.55us |   6.07us |   9.04us |
                     zeCommandListClose |  13.33us |   0.00% |    34 | 392.15ns |    127ns |   4.12us |
                              zeMemFree |  10.14us |   0.00% |     1 |  10.14us |  10.14us |  10.14us |
                  zeCommandQueueDestroy |   7.04us |   0.00% |     8 | 880.38ns |    166ns |   3.18us |
                      zeEventPoolCreate |   6.98us |   0.00% |     2 |   3.49us |   3.16us |   3.83us |
zeCommandListAppendWriteGlobalTimestamp |   6.12us |   0.00% |     1 |   6.12us |   6.12us |   6.12us |
                      zeEventHostSignal |   5.32us |   0.00% |     6 | 887.33ns |    756ns |   1.13us |
                            zeDeviceGet |   4.91us |   0.00% |     2 |   2.45us |   2.10us |   2.81us |
         zeCommandListAppendSignalEvent |   4.41us |   0.00% |     6 | 735.17ns |    390ns |   1.08us |
                         zeKernelCreate |   3.93us |   0.00% |     1 |   3.93us |   3.93us |   3.93us |
            zeDeviceGetGlobalTimestamps |   3.51us |   0.00% |     1 |   3.51us |   3.51us |   3.51us |
                   zeKernelSetGroupSize |   1.60us |   0.00% |     2 | 801.50ns |    751ns |    852ns |
                      zeContextCreateEx |   1.00us |   0.00% |     1 |   1.00us |   1.00us |   1.00us |
              zeKernelSetIndirectAccess |    863ns |   0.00% |     2 | 431.50ns |    386ns |    477ns |
                            zeDriverGet |    673ns |   0.00% |     2 | 336.50ns |    142ns |    531ns |
                                 zeInit |    532ns |   0.00% |     1 | 532.00ns |    532ns |    532ns |
                 zeModuleGetKernelNames |    508ns |   0.00% |     2 | 254.00ns |    122ns |    386ns |
                                  Total |   36.14s | 100.00% |  2433 |

Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Devices | 1 Subdevices |

                                   Name |    Time | Time(%) | Calls |  Average |    Min |     Max |
             zeCommandListAppendBarrier | 37.02us |  54.19% |    58 | 638.34ns |  416ns |  2.39us |
                                 addOne | 13.00us |  19.03% |     2 |   6.50us | 2.81us | 10.19us |
                       zeMemoryCopy(MD) | 10.82us |  15.83% |     3 |   3.61us | 3.43us |  3.95us |
                       zeMemoryCopy(SM) |  3.43us |   5.02% |     1 |   3.43us | 3.43us |  3.43us |
                       zeMemoryCopy(DM) |  3.02us |   4.41% |     1 |   3.02us | 3.02us |  3.02us |
zeCommandListAppendWriteGlobalTimestamp |  1.04us |   1.52% |     1 |   1.04us | 1.04us |  1.04us |
                                  Total | 68.33us | 100.00% |    66 |

Explicit memory traffic (BACKEND_ZE) | 1 Hostnames | 1 Processes | 1 Threads |

            Name | Byte | Byte(%) | Calls | Average | Min | Max |
zeMemoryCopy(SM) |   8B |  44.44% |     1 |   8.00B |  8B |  8B |
zeMemoryCopy(MD) |   6B |  33.33% |     3 |   2.00B |  1B |  4B |
zeMemoryCopy(DM) |   4B |  22.22% |     1 |   4.00B |  4B |  4B |
           Total |  18B | 100.00% |     5 |

reinstalling should be as easy as spack uninstall thapi@master followed by spack install thapi@master