javierhonduco / rbperf

Low-overhead sampling profiler and tracer for Ruby for Linux

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

setup_perf_event fails with ENODEV (`perf` works)

shaver opened this issue · comments

Hi there,

When I run the suggested steps in tests/programs/Dockerfile.server, rbperf errors out with

pid: 38686
libruby: /usr/local/lib/libruby.so.3.0.0 @ 0x7f54251d7000
ruby main thread address: 0x7f5425598138
process base address: 0x55b26e96f000
ruby version: "3.0.0"

Error: setup_perf_event failed with errno No such device

According to the perf_event_open(2) docs, this indicates that my CPU is missing a feature, but standard Linux perf can sample from that ruby process without problem.

My cpuinfo is attached in case it's helpful (AMD 5800X3D): cpuinfo.txt

I'm running kernel Linux CRAGNOR 6.0.9-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 16 Nov 2022 17:01:17 +0000 x86_64 GNU/Linux

I'm not sure if it's a configuration issue (in which case I'll submit a PR for README.md) or some limitation of the way that rbperf is configuring the perf events.

I finally thought of stracing it, and perf_event_open works correctly a number of times, but then fails trying to do so for CPU #16. I have 16 logical processors (0-15), but /sys/devices/system/cpu/possible which underlies num_possible_cpus reports 0-31. Should setup_perf_event or start handle this more gracefully, or do I have a system misconfiguration here?

strace.txt

Edit: it looks like perf uses /sys/devices/system/cpu/present to determine how many CPUs to perf_event_open on, but I don't see a handy libbpf function to parse that. I jammed in the [num_cpus](https://crates.io/crates/num_cpus) crate and it seems to be working now! I'll submit a PR when I get a minute, though it's pretty simple.

Oh, this is an interesting case! I expected libbpf's helper to use /sys/devices/system/cpu/present instead. Will bring this up with its maintainers just in case.

Either way, I've never seen a mismatch between possible and present before 😮. While there's definitely a bug either in libbpf or in rbperf (probably in rbperf as so many people use libbpf!) in how this error is handled. This made me think that perhaps hyperthreading (HT) might be disabled in your system?

Tried disabling hyperthreading in my box:

# hyperthreading is enabled in my machine
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/smt/active
1
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/possible
0-11
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/online
0-11
# let's disable hyperthreading
[javierhonduco@fedora parca-agent]$ echo off | sudo tee /sys/devices/system/cpu/smt/control
off
# confirming that's disabled
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/smt/active
0
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/possible
0-11
[javierhonduco@fedora parca-agent]$ cat /sys/devices/system/cpu/online
0-5

And could reproduce!

$ rbperf record -p `pidof ruby` cpu
Error: setup_perf_event failed with errno No such device

Would you mind checking if hyperthreading might be disabled in your box? The output of both lscpu | grep "Thread(s) per core" and /sys/devices/system/cpu/smt/active should help us see whether this theory could be correct

Thanks so much for the detailed issue BTW :). Happy to merge your fix either way, as it doesn't seem like we would regress in any way, but let's see what's the state of HT in your system first if you don't mind 😄

I have SMT on; 5800X3D is 8C/16T. I don't know where the possible of 0-31 comes from, but it survives disabling hyperthreading as well:

[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/smt/active 
1
[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/possible
0-31
[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/online
0-15
[shaver@CRAGNOR ~]$ echo off | sudo tee /sys/devices/system/cpu/smt/control
off
[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/smt/active 
0
[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/possible
0-31
[shaver@CRAGNOR ~]$ cat /sys/devices/system/cpu/online
0-7

Found this SO post, could be related to what you are seeing

(closing as your PR fixes the issue)