oxidecomputer / omicron

Omicron: Oxide control plane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why do these Nexus core files not show some function names in their stack traces?

sunshowers opened this issue · comments

commented

During today's mupdate, we saw a couple of Nexus cores on gc08 which have been rsynced over to /staff/dock/rack2/mupdate-20240329/cores/sled-08. With both those cores, whether you do pstack or mdb with $c, you get something like:

libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()
...

Why do the functions in the middle (0x5a05ef9 etc) not have their names in the stack traces? Maybe they're in std -- but if so, why does std::panicking::begin_panic_handler get shown?

I'm concerned that this issue may be more widespread somehow.

The compiler is eliding epilogues for functions that never return, like __rust_start_panic. See also https://www.illumos.org/issues/8357

> 5a207a0::dis
__rust_start_panic:             pushq  %rbp
__rust_start_panic+1:           movq   %rsp,%rbp
__rust_start_panic+4:           call   +0x7     <panic_abort::__rust_start_panic::abort::h09fa6304fad614af>
> ::dis
0x5a207a9:                      nop

In this case the return address on the stack is 0x5a207a9, which is what the backtrace shows, but it's outside __rust_start_panic so not resolved by $c or pstack.

A thing I commonly do here is to subtract one from the address and use ::whatis

> 5a207a8::whatis
5a207a8 is __rust_start_panic+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)

The same is true of 0x5a207b9 - it's the address directly after the end of __rust_start_panic.

5a207b8 is panic_abort::__rust_start_panic::abort::h09fa6304fad614af+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)

The others are probably the same thing:

> 0x5a05ef8::whatis
5a05ef8 is std::sys_common::backtrace::__rust_end_short_backtrace::hf4ffec223b5c52aa+8, in /opt/oxide/omicron-nexus/bin/nexus [400000,5f56000)
commented

Thanks @citrus-it! Is there anything that can be done here that will help? It's at least good to know that this only occurs for functions that never return.

I suspect we could actually enhance the debugger to understand that this is occuring, as it seems like it would be relatively predictable.

Yes, some heuristics should be possible. A return address which is directly after the end of a function (and optionally right at the start of another) is a good clue. In this rust case the addresses show up because there is a gap of NOPs between the functions. In the example from https://www.illumos.org/issues/8357 there is no such gap and the stack trace has resolved symbols but some of them are wrong.