pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On demand profiling example / code changes

shradhasehgal opened this issue · comments

Hi, is there an example for how we can enable on demand profiling with kineto?
The libkineto README mentions that we can send a 'signal' or 'trigger' on demand profiling, but I am unclear on how we can do so from outside the PyTorch script.

Would highly appreciate if somebody could provide an example or point me to the relevant APIs / source files. Thank you!!

I also tried using the third party library Dynolog but ran into dependency issues during installation.
Since it uses Kineto under the hood, I was wondering if there was a way to implement on demand profiling with Kineto directly, rather than using the Dynolog library on top of it.

@shradhasehgal Yes, the README needs a lot more details and updates.
To enable the profiling via a signal you first need to set the config for it.
When kineto starts up it will read this file.

echo "ENABLE_SIGUSR2=YES" >  /etc/libkineto.conf

When you run your application set the env variable export KINETO_USE_DAEMON=1
Then while running the application you can send it a sigusr2

kill -USR2 <pid>

This should dump the trace in /tmp/libkineto.<>_.json.
PS: you can pass additional config options for your on-demand run by populating /tmp/libkineto.conf
Please let us know if this works.

@shradhasehgal Could you also explain the issue you saw with dynolog?
The flow for collecting traces is a lot more simpler with dynolog (assuming you are using a linux based OS)
https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md#triggering-an-on-demand-trace

cc @anupambhatnagar

You don't need to setup all these files etc, and it will also allow you to configure the trace duration, which process to target etc.

@shradhasehgal Could you also explain the issue you saw with dynolog? The flow for collecting traces is a lot more simpler with dynolog (assuming you are using a linux based OS) https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md#triggering-an-on-demand-trace

cc @anupambhatnagar

You don't need to setup all these files etc, and it will also allow you to configure the trace duration, which process to target etc.

Hi @briancoutinho thanks for your reply! I tried building Dynolog from source. Although I am able to start the Dynolog server, I am not able to capture the traces.

Here's what all I tried:
I explicitly set the environment variable export KINETO_USE_DAEMON=1and then ran the python script. I also tried running ./build/bin/dyno gputrace --pids <pid> but I get the 'no processes matched' error.

@shradhasehgal Yes, the README needs a lot more details and updates. To enable the profiling via a signal you first need to set the config for it. When kineto starts up it will read this file.

echo "ENABLE_SIGUSR2=YES" >  /etc/libkineto.conf

When you run your application set the env variable export KINETO_USE_DAEMON=1 Then while running the application you can send it a sigusr2

kill -USR2 <pid>

This should dump the trace in /tmp/libkineto.<>_.json. PS: you can pass additional config options for your on-demand run by populating /tmp/libkineto.conf Please let us know if this works.

Hey Brian, thanks for your answer. I wish to enable on-demand profiling such that I can trigger the 'start' and 'stop' of the profiler from outside a Python job.

Along with the set up of the files you mentioned above, should I declare a sigusr2_handler in the python script I wish to profile? So we would initialize the profiler in the handler the first time we receive a sigusr2 signal and stop the profiler the next time we receive it - is my understanding correct?

Hi @shradhasehgal,
On dynolog:

I explicitly set the environment variable export KINETO_USE_DAEMON=1and then ran the python script. I also tried running ./build/bin/dyno gputrace --pids but I get the 'no processes matched' error.

When you run with the environment variable set do you see this line? This does require using a version of PyTorch starting from 2.0 release.
Registering daemon config loader

If so, running the command ./build/bin/dyno gputrace in another terminal/shell should be able to collect the gpu trace for you. Also, assuming that dynolog server is already running, for example via systemctl.
Let me know if you have any questions.

PS: You can always download the binaries from dynolog release itself instead of building from source
= https://github.com/facebookincubator/dynolog/tree/main#installation
= https://github.com/facebookincubator/dynolog/tree/main#no-sudo-access

On your question using sig-usr2 approach (hoping you are able to get dynolog to work :))

Along with the set up of the files you mentioned above, should I declare a sigusr2_handler in the python script I wish to profile?

You need not setup a sigusr2 handler, Kineto will do that when it sees ENABLE_SIGUSR2=YES in the config file

So we would initialize the profiler in the handler the first time we receive a sigusr2 signal and stop the profiler the next time we receive it - is my understanding correct?

Yes, that is correct. Kineto/PyTorch will start the trace when you send it a sigusr2. Please let me know if either approach above works.

Hi @shradhasehgal, On dynolog:

I explicitly set the environment variable export KINETO_USE_DAEMON=1and then ran the python script. I also tried running ./build/bin/dyno gputrace --pids but I get the 'no processes matched' error.

When you run with the environment variable set do you see this line? This does require using a version of PyTorch starting from 2.0 release. Registering daemon config loader

If so, running the command ./build/bin/dyno gputrace in another terminal/shell should be able to collect the gpu trace for you. Also, assuming that dynolog server is already running, for example via systemctl. Let me know if you have any questions.

PS: You can always download the binaries from dynolog release itself instead of building from source = https://github.com/facebookincubator/dynolog/tree/main#installation = https://github.com/facebookincubator/dynolog/tree/main#no-sudo-access

Hey Brian, thank you for your reply! You're right, I was previously on PyTorch version 1.13 and porting to 2.0 made it work.

However, due to the application that I am building, I do require torch version < 2.0.
Would there be any workarounds to make Dynolog work in such a case? Wondering what functionality of torch 2.0 enabled the registration of the daemon config loader.

Hi @shradhasehgal, On dynolog:

I explicitly set the environment variable export KINETO_USE_DAEMON=1and then ran the python script. I also tried running ./build/bin/dyno gputrace --pids but I get the 'no processes matched' error.

When you run with the environment variable set do you see this line? This does require using a version of PyTorch starting from 2.0 release. Registering daemon config loader
If so, running the command ./build/bin/dyno gputrace in another terminal/shell should be able to collect the gpu trace for you. Also, assuming that dynolog server is already running, for example via systemctl. Let me know if you have any questions.
PS: You can always download the binaries from dynolog release itself instead of building from source = https://github.com/facebookincubator/dynolog/tree/main#installation = https://github.com/facebookincubator/dynolog/tree/main#no-sudo-access

Hey Brian, thank you for your reply! You're right, I was previously on PyTorch version 1.13 and porting to 2.0 made it work.

However, due to the application that I am building, I do require torch version < 2.0. Would there be any workarounds to make Dynolog work in such a case? Wondering what functionality of torch 2.0 enabled the registration of the daemon config loader.

Does this method also require torch version > 2.0?
On torch 1.13, if I send a SIGUSR2 signal to the Pytorch job, it kills the job and outputs 'User defined signal 2'.

Would there be any other way to enable on-demand profiling using an older torch version - I was thinking we could set up a sigusr2 handler that toggles the state of the torch profiler (calling profiler.start() and profiler.stop() accordingly.

@shradhasehgal

Would there be any workarounds to make Dynolog work in such a case? Wondering what functionality of torch 2.0 enabled the registration of the daemon config loader.

oh it has nothing to do with 2.0, unfortunately that is when our commits landed to PyTorch so using versions above that make it easier to run this flow - both dynolog and sigusr2.
pytorch/pytorch@f4b804e
The reason we need these commits are to initialize the kineto library, which happens lazily otherwise.

Here is one trick you could try. Somewhere during the start of your PyTorch program just call the profiler. See this example code - https://github.com/facebookresearch/param/blob/bbd06456832b188777ca1d91cfe0bad751f93fdc/train/compute/python/pytorch/run_benchmark.py#L284-L290
Now kineto should be initialized.

You can then try sending SIGUSR2 to the program as discussed above.

Screenshot 2023-06-22 at 4 13 22 PM

Hi @briancoutinho, I ran the code with the function you shared above. However, when I send the SIGUSR2 signal, it leads to the error "Failed to parse config: Invalid PROFILE_START_TIME: 00:00:00 - start time is more than 10s in the past ; line: PROFILE_START_TIME=0" (pic attached below) and does not generate traces.

I ensured that I let the script run for sufficient time and then sent the SIGUSR2 signal, so Kineto had sufficient time for initialization.

Here is the code I ran in case needed:

import time

import torch
import torch.profiler

import os

with torch.autograd.profiler.profile(
    enabled=True,
    use_cuda=True,
    use_kineto=True,
) as _:
    print("Running dummy profiler warmup for CUPTI.")

print(os.getpid())

dtype = torch.float
device = torch.device("cuda:0")  # Uncomment this to run on GPU

x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

p = torch.tensor([1, 2, 3], device=device)
xx = x.unsqueeze(-1).pow(p)
print(xx.device)
model = torch.nn.Sequential(torch.nn.Linear(3, 1).to(device), torch.nn.Flatten(0, 1))
loss_fn = torch.nn.MSELoss(reduction="sum")

learning_rate = 1e-6
for t in range(20000000):
    y_pred = model(xx)
    loss = loss_fn(y_pred, y)
    if t % 10000 == 99:
        time.sleep(200000)

    model.zero_grad()
    loss.backward()
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
linear_layer = model[0]

@shradhasehgal I could not reproduce the error using your test code on the trunk i.e. I could collect traces using sigusr2.
Could you try this out - you can add options for the signal based trace in /tmp/libkineto.conf. Add this in /tmp/libkineto.conf.

PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=1000

You can change the trace duration if you like, just set it to 1 second for now. Profile start time = 0 will make kineto automatically fill the start time to something reasonable. Please let us know if this works

I'll take a stab at using torch.1.13 release to see if the issue pops up there. Just to confirm this is version you are using?
https://github.com/pytorch/pytorch/releases/tag/v1.13.1 or https://github.com/pytorch/pytorch/releases/tag/v1.13.0
I found a bugfix in March 2022 in this area #554 , but this should be included in torch.1.13.

@briancoutinho I'm having trouble enabling stack, memory, and module tracing through the on demand config file in /tmp/libkineto.conf

# /tmp/libkineto.conf
PROFILE_REPORT_INPUT_SHAPES=true
PROFILE_PROFILE_MEMORY=true
PROFILE_WITH_STACK=true
PROFILE_WITH_FLOPS=true
PROFILE_WITH_MODULES=true

I run the scripts/pytorch/linear_model_example.py from dynolog after doing export KINETO_USE_DAEMON=1 which gives me Registering daemon config loader

Running in a separate terminal on the same machine:

dyno gputrace --log-file /tmp/libkineto_trace.json

which gives me

Kineto config = 
ACTIVITIES_LOG_FILE=/tmp/libkineto_trace.json\nPROFILE_START_TIME=0\nACTIVITIES_DURATION_MSECS=500
response length = 143
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[19837],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[19837]}
Matched 1 processes
Trace output files will be written to:
    /tmp/libkineto_trace_19837.json

but the output JSON contains no shape, memory, or stack information.

@shradhasehgal I could not reproduce the error using your test code on the trunk i.e. I could collect traces using sigusr2. Could you try this out - you can add options for the signal based trace in /tmp/libkineto.conf. Add this in /tmp/libkineto.conf.

PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=1000

You can change the trace duration if you like, just set it to 1 second for now. Profile start time = 0 will make kineto automatically fill the start time to something reasonable. Please let us know if this works

I'll take a stab at using torch.1.13 release to see if the issue pops up there. Just to confirm this is version you are using? https://github.com/pytorch/pytorch/releases/tag/v1.13.1 or https://github.com/pytorch/pytorch/releases/tag/v1.13.0 I found a bugfix in March 2022 in this area #554 , but this should be included in torch.1.13.

Hi @briancoutinho I am using torch 1.12 https://pytorch.org/blog/pytorch-1.12-released/

Is the on demand tracing feature not available with the same?

@shradhasehgal yes, please try torch 1.13, you are likely hitting the bug and it is fixed in 1.13