The results captured in "DIFF" view are incomplete compared to those in "NORMAL" view
wizzniu opened this issue · comments
Description
When profiling network training with backward and then showing results in tensorboard, the number of operator calls in DIFF view is obviously less than that in NORMAL view. The same bug occurs for execution time. It seems that DIFF tool only catch the forward's thread and doesn't create forward-backward association as expected.
Environment
- Python 3.8.10
- torch 2.1.0
- tensorboard 2.14.0
Screenshots
NORMAL view
DIFF view
Reasons
It may be because when performing comparison between base-run and exp-run, only the main thread are selected to exec, ignoring the other threads(e.g., backward's thread). code refs in https://github.com/pytorch/kineto/blob/main/tb_plugin/torch_tb_profiler/run.py#L469