pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed view empty and no communication shown

aamijar opened this issue · comments

Hi, I am using the sample script in this repository resnet50_ddp_profiler.py from https://github.com/pytorch/kineto/blob/main/tb_plugin/examples/resnet50_ddp_profiler.py

Using

Python3.8
torch=2.0.1
torch-tb-profiler=0.4.3 # built from source

In tensorboard in the overview view the communication is 0.
In the distributed view:

  • there are no bar charts shown for Synchronizing/Communication Overview.
  • the table at the bottom called Communication Operation stats has 0 values in columns total latency, avg latency, data transfer time, avg data transfer time.

When I try using

Python3.8
torch=1.11.0
torch-tb-profiler=0.4.3 # built from source

There are no issues and the views show up properly.

However even for torch=1.12+ there are issues in communication and distributed view not showing up properly.

Does anyone have any insight into why this may be the case?

I'm looking at the .json logs for both of these runs.

An observation I found is that the torch=2.0.1 generated .json
specifically for the objects in the json that has the name "ncclKernel_AllReduce_RING_LL_Sum_float(ncclDevComm*, unsigned long, ncclWork*)"

External id and correlation fields are the same value

whereas in torch=1.11.0
External id and correlation fields have different values

in torch=1.11.0
the External id also match with various other .json objects where the name can be cudaEventRecord, cudaLaunchKernel etc.

This is not the case in the torch=2.0.1 generated .json

@aaronenyeshi Do you know of any ways to resolve this and are you able to replicate the results from above?

Unfortunately, we are lacking resources to fix tb_plugin bugs. Plans for it are still pending.

However, the OSS community is free to submit fixes for these issues via Github PRs.