Why do these Nexus core files not show some function names in their stack traces?
sunshowers opened this issue · comments
During today's mupdate, we saw a couple of Nexus cores on gc08 which have been rsynced over to /staff/dock/rack2/mupdate-20240329/cores/sled-08
. With both those cores, whether you do pstack
or mdb
with $c
, you get something like:
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()
...
Why do the functions in the middle (0x5a05ef9
etc) not have their names in the stack traces? Maybe they're in std
-- but if so, why does std::panicking::begin_panic_handler
get shown?
I'm concerned that this issue may be more widespread somehow.
The compiler is eliding epilogues for functions that never return, like __rust_start_panic
. See also https://www.illumos.org/issues/8357
> 5a207a0::dis
__rust_start_panic: pushq %rbp
__rust_start_panic+1: movq %rsp,%rbp
__rust_start_panic+4: call +0x7 <panic_abort::__rust_start_panic::abort::h09fa6304fad614af>
> ::dis
0x5a207a9: nop
In this case the return address on the stack is 0x5a207a9
, which is what the backtrace shows, but it's outside __rust_start_panic
so not resolved by $c
or pstack
.
A thing I commonly do here is to subtract one from the address and use ::whatis
> 5a207a8::whatis
5a207a8 is __rust_start_panic+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)
The same is true of 0x5a207b9
- it's the address directly after the end of __rust_start_panic
.
5a207b8 is panic_abort::__rust_start_panic::abort::h09fa6304fad614af+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)
The others are probably the same thing:
> 0x5a05ef8::whatis
5a05ef8 is std::sys_common::backtrace::__rust_end_short_backtrace::hf4ffec223b5c52aa+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)
Thanks @citrus-it! Is there anything that can be done here that will help? It's at least good to know that this only occurs for functions that never return.
I suspect we could actually enhance the debugger to understand that this is occuring, as it seems like it would be relatively predictable.
Yes, some heuristics should be possible. A return address which is directly after the end of a function (and optionally right at the start of another) is a good clue. In this rust case the addresses show up because there is a gap of NOPs between the functions. In the example from https://www.illumos.org/issues/8357 there is no such gap and the stack trace has resolved symbols but some of them are wrong.