Distributed view empty and no communication shown
aamijar opened this issue · comments
Hi, I am using the sample script in this repository resnet50_ddp_profiler.py
from https://github.com/pytorch/kineto/blob/main/tb_plugin/examples/resnet50_ddp_profiler.py
Using
Python3.8
torch=2.0.1
torch-tb-profiler=0.4.3 # built from source
In tensorboard in the overview view the communication is 0.
In the distributed view:
- there are no bar charts shown for Synchronizing/Communication Overview.
- the table at the bottom called Communication Operation stats has 0 values in columns total latency, avg latency, data transfer time, avg data transfer time.
When I try using
Python3.8
torch=1.11.0
torch-tb-profiler=0.4.3 # built from source
There are no issues and the views show up properly.
However even for torch=1.12+
there are issues in communication and distributed view not showing up properly.
Does anyone have any insight into why this may be the case?
I'm looking at the .json
logs for both of these runs.
An observation I found is that the torch=2.0.1
generated .json
specifically for the objects in the json that has the name "ncclKernel_AllReduce_RING_LL_Sum_float(ncclDevComm*, unsigned long, ncclWork*)"
External id
and correlation
fields are the same value
whereas in torch=1.11.0
External id
and correlation
fields have different values
in torch=1.11.0
the External id
also match with various other .json
objects where the name
can be cudaEventRecord
, cudaLaunchKernel
etc.
This is not the case in the torch=2.0.1
generated .json
@aaronenyeshi Do you know of any ways to resolve this and are you able to replicate the results from above?
Unfortunately, we are lacking resources to fix tb_plugin bugs. Plans for it are still pending.
However, the OSS community is free to submit fixes for these issues via Github PRs.