How to trace torch cuda time in C++ using kineto？

Question

How to trace torch cuda time in C++ using kineto？

TianShaoqing opened this issue a year ago · comments

The problem
Hi, I am using the pytorch profile to trace the gpu performance of models, and it works well in python.
For example:

import torch
from torch.autograd.profiler import profile, record_function

with profile(record_shapes=True, use_cuda=True, use_kineto=True, with_stack=False) as prof:
    with record_function("model_inference"):
        a = torch.randn(128, 128, device=torch.device('cuda:0'))
        b = torch.randn(128, 128, device=torch.device('cuda:0'))
        c = a + b

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=50))

Now, I want to implement the above code in C++ and get each operator's cuda (kernel) time. But I found very few relevant examples. So I implemented a C++ program against the python interface.

#include <torch/csrc/autograd/profiler_kineto.h>
...
...
const std::set<torch::autograd::profiler::ActivityType> activities(
      {torch::autograd::profiler::ActivityType::CPU, torch::autograd::profiler::ActivityType::CUDA});

torch::autograd::profiler::prepareProfiler(
      torch::autograd::profiler::ProfilerConfig(
        torch::autograd::profiler::ProfilerState::KINETO, false, false), activities);

torch::autograd::profiler::enableProfiler(
      torch::autograd::profiler::ProfilerConfig(
        torch::autograd::profiler::ProfilerState::KINETO, false, false), activities);

auto a = torch::rand({128, 128}, {at::kCUDA});
auto b = torch::rand({128, 128}, {at::kCUDA});
auto c = a + b;

auto profiler_results_ptr = torch::autograd::profiler::disableProfiler();
const auto& kineto_events = profiler_results_ptr->events();

for (const auto e : kineto_events) {
    std::cout << e.name() << " " << e.cudaElapsedUs() << " " << e.durationUs()<<std::endl;
}

But the printed cuda time is all equal to -1 like:

aten::empty -1 847
aten::uniform_ -1 3005641
aten::rand -1 3006600
aten::empty -1 21
aten::uniform_ -1 53
aten::rand -1 82
aten::add -1 156
cudaStreamIsCapturing -1 8
_ZN2at6native90_GLOBAL__N__66_tmpxft_000055e0_00000000_13_DistributionUniform_compute_86_cpp1_ii_f2fea07d43distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda21uniform_and_transformIffLm4EPNS_17CUDAGeneratorImplEZZZNS4_14uniform_kernelIS7_EEvRNS_18TensorIteratorBaseEddT_ENKUlvE_clEvENKUlvE2_clEvEUlfE_EEvSA_T2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIffLi4ES7_SJ_SE_EEvSA_SF_RKSG_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SF_ -1 2
cudaLaunchKernel -1 3005499
cudaStreamIsCapturing -1 4
_ZN2at6native90_GLOBAL__N__66_tmpxft_000055e0_00000000_13_DistributionUniform_compute_86_cpp1_ii_f2fea07d43distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda21uniform_and_transformIffLm4EPNS_17CUDAGeneratorImplEZZZNS4_14uniform_kernelIS7_EEvRNS_18TensorIteratorBaseEddT_ENKUlvE_clEvENKUlvE2_clEvEUlfE_EEvSA_T2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIffLi4ES7_SJ_SE_EEvSA_SF_RKSG_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SF_ -1 1
cudaLaunchKernel -1 14
void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<char*, 3> >(int, at::native::AddFunctor<float>, at::detail::Array<char*, 3>) -1 1
cudaLaunchKernel -1 16

I carefully compared the differences between the above two programs (python and C++) but did not find the cause of the problem. I also tried other parameter combinations and couldn't get the real cuda time.

Expected behavior
It can output cuda time of each operator in C++ program like python.

Environment version
OS: CentOS release 7.5 (Final)
nvidia driver version: 460.32.03
CUDA version: 11.2
PyTorch version: 1.9.0+cu111
Python version: 3.6.5
GPU: A10

Aaron Shi · Answer 1 · Fri Feb 24 2023 02:40:32 GMT+0800 (China Standard Time)

Hi @TianShaoqing , I took a look at your C++ example. Looks like you are calling the correct profiler API in the right order. When taking a look at the cudaElapsedUs() function, I saw that it will return -1 when the start or end time is outside the range. So for this case, can you please try adding a synchronize right after auto c = a + b;?

Another way to debug this, can you please save the trace file using profiler_results_ptr->save("./trace.json"); after disableProfiler. We can take a look at the trace file at chrome://tracing. My guess is that the GPU kernels are queued up, but profiler is disabled before they are finished running.

TianShaoqing · Answer 2 · Thu Mar 09 2023 21:22:45 GMT+0800 (China Standard Time)

Hi @aaronenyeshi, thank you for your reply. 😄

In the last two weeks, I have carefully checked the cause of the error. I found that KinetoEvent.cudaElapsedUs() can return time > 0 only when ProfilerState is KINETO_GPU_FALLBACK. But the time it shows is much longer than the actual execution time of the cuda kernel. This result does not seem to be what I expected either.

So I simply follow the steps of parse_kineto_results and EventList._build_tree() of torch.autograd.profiler in python to implement the C++ version of parsing events. Finally, the cuda time output is correct and as I expected.

It's my mistake, I thought the problem was simple at the beginning. Using kineto to get the real cuda time of a torch operator, you need to associate the time of its children functions and correlation kernels as the python code does.

The kineto is still a very powerful profiler tool to me.👍 Thank you for your reply again. Hope to have the opportunity to cooperate in the future.