pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The results captured in "DIFF" view are incomplete compared to those in "NORMAL" view

wizzniu opened this issue · comments

Description

When profiling network training with backward and then showing results in tensorboard, the number of operator calls in DIFF view is obviously less than that in NORMAL view. The same bug occurs for execution time. It seems that DIFF tool only catch the forward's thread and doesn't create forward-backward association as expected.

Environment

  • Python 3.8.10
  • torch 2.1.0
  • tensorboard 2.14.0

Screenshots

NORMAL view
DIFF view

Reasons

It may be because when performing comparison between base-run and exp-run, only the main thread are selected to exec, ignoring the other threads(e.g., backward's thread). code refs in https://github.com/pytorch/kineto/blob/main/tb_plugin/torch_tb_profiler/run.py#L469

After creating forward-backward association correctly, the backward ops can be caught and the DIFF view is showed as expected: