Deal with off-by-one addresses falling right outside of a symbol

Question

Deal with off-by-one addresses falling right outside of a symbol

gabrielesvelto opened this issue 2 years ago · comments

We found crashes in the wild (like this one in WebRTC code) where the crash address in the top frame falls right after the end of a symbol. The resulting stacks are garbage as the stack walker starts scanning and finds completely unrelated stuff.

Adjusting the first frame to check for frame.instruction_address - 1 yields the right symbol, unwinding information and stack so I'm tempted to pile this hack^CR^CR^CR^CR heuristic on top of the existing stack-walking code.

Gabriele Svelto · Answer 1 · Thu Jan 12 2023 20:08:24 GMT+0800 (China Standard Time)

Note it could be that this issue is because dump_syms is miscalculating the size of the symbol, or the debug information is wrong. Since it's happening in non-Mozilla builds (Ubuntu, Arch) I don't have the time to trace it all the way back to the debug information.

Arpad Borsos · Answer 2 · Thu Jan 12 2023 22:26:54 GMT+0800 (China Standard Time)

I assume the .sym files assume exclusive symbol end addresses? Meaning the next higher symbol starts at the same addr (start is inclusive)? Is this well documented somewhere? Does dump_syms (and symbolic-debuginfo for that matter) behave correctly here?

Also, have you looked at a snippet of disassembly around that instruction addr? Is it an invalid / trap instruction following the normal fn body?

Gabriele Svelto · Answer 3 · Fri Jan 13 2023 04:17:27 GMT+0800 (China Standard Time)

I assume the .sym files assume exclusive symbol end addresses? Meaning the next higher symbol starts at the same addr (start is inclusive)? Is this well documented somewhere? Does dump_syms (and symbolic-debuginfo for that matter) behave correctly here?

Yes, it's exclusive and they're both consistent with this.

Also, have you looked at a snippet of disassembly around that instruction addr? Is it an invalid / trap instruction following the normal fn body?

Yeah, it looks like this:

 36e8da4:       e8 b7 cf fa ff          call   3695d60 <_ZNKSt4lessISt17basic_string_viewIcSt11char_traitsIcEEEclERKS3_S6_@@xul108+0x2f90>
 36e8da9:       00 00                   add    %al,(%rax)

Note the zeroes, it's zeroes all the way to the following symbol which is clearly padding. The address of the call instruction corresponds to this line, so my guess is that this is a tail call and the callee function will return right back to the caller two frames up.

IIRC we already have some heuristics & hacks around this kind of scenario WRT tail calls.

Arpad Borsos · Answer 4 · Fri Jan 13 2023 05:19:19 GMT+0800 (China Standard Time)

I’m by no means an expert on compiler codegen, but shouldn’t tail calls use jmp rather than call?

the callee function will return right back to the caller two frames up.

That assumes the callee function knows that it is being tail-called, and actually pop the bogus return addr from the stack before doing a ret. Does the callee actually do that?

Gabriele Svelto · Answer 5 · Fri Jan 13 2023 17:49:45 GMT+0800 (China Standard Time)

I’m by no means an expert on compiler codegen, but shouldn’t tail calls use jmp rather than call?

Yes, I misread the code. There's heavy inlining taking place in that piece of code and the function where the crash is happening is not a tail call.

Arpad Borsos · Answer 6 · Fri Jan 13 2023 23:23:34 GMT+0800 (China Standard Time)

the function where the crash is happening is not a tail call.

I believe a call followed by padding is a read flag. How does the callee epilogue look like?

Markus Stange · Answer 7 · Sat Jan 14 2023 00:54:10 GMT+0800 (China Standard Time)

There is one legitimate use case for call-followed-by-padding: If the call is to a noreturn function, i.e. if the caller knows that the callee will either hit a call to abort() or will loop indefinitely.

But this doesn't appear to be the case here: The call is to rtc::webrtc_logging_impl::Log(rtc::webrtc_logging_impl::LogArgType const*, ...) which does not seem to be a noreturn function... unless the caller knows that this call will hit one of the two RTC_DCHECK_NOTREACHED() statements. Is that what's happening?

@gabrielesvelto, where did you get the disassembly? Do you have a link to the binary?

Jed Davis · Answer 8 · Sat Jan 14 2023 01:43:47 GMT+0800 (China Standard Time)

I was looking at this over on Bugzilla before I saw this thread, and mostly duplicated what's already been said here. I speculated that the call to Log might have been inside a noreturn call that was inlined into the function whose machine code ends with a call instruction, but I haven't looked at the source code / symbol info closely enough to know if that's possible here.

As for RTC_DCHECK_NOTREACHED, normally anything named DCHECK is only in debug builds, but it's possible to define DCHECK_ALWAYS_ON to override that, and it's possible we're doing that.

Gabriele Svelto · Answer 9 · Sat Jan 14 2023 05:17:44 GMT+0800 (China Standard Time)

@gabrielesvelto, where did you get the disassembly? Do you have a link to the binary?

It's a Ubuntu snap build. I had to extract it manually, I can put the binary and symbols somewhere to share them.

Gabriele Svelto · Answer 10 · Sat Jan 14 2023 05:48:14 GMT+0800 (China Standard Time)

Here's everything from the snap.

Markus Stange · Answer 11 · Sun Jan 15 2023 08:18:18 GMT+0800 (China Standard Time)

Thanks!

I found the same issue in the official Mozilla build of Linux Firefox 108.0.2, at 0x380675f. So it's not a distro build issue. But I still don't understand why the compiler would generate this code.

Firefox 108.0.2, libxul.so.dbg.gz, libxul.so.sym

% addr2line -pCfi -e /Users/mstange/Downloads/libxul.so.dbg 0x380675f
void rtc::webrtc_logging_impl::LogStreamer<>::Call<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata>, rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*> >(rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata> const, rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*>&) at /builds/worker/checkouts/gecko/third_party/libwebrtc/rtc_base/logging.h:354
 (inlined by) void rtc::webrtc_logging_impl::LogStreamer<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata> >::Call<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*> >(rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*> const&) const at /builds/worker/checkouts/gecko/third_party/libwebrtc/rtc_base/logging.h:384
 (inlined by) void rtc::webrtc_logging_impl::LogStreamer<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*>, rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata> >::Call<>( const&) const at /builds/worker/checkouts/gecko/third_party/libwebrtc/rtc_base/logging.h:384
 (inlined by) bool rtc::webrtc_logging_impl::LogCall::operator&<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*>, rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata> >(rtc::webrtc_logging_impl::LogStreamer<rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)9, char const*>, rtc::webrtc_logging_impl::Val<(rtc::webrtc_logging_impl::LogArgType)13, rtc::webrtc_logging_impl::LogMetadata> > const&) at /builds/worker/checkouts/gecko/third_party/libwebrtc/rtc_base/logging.h:402
 (inlined by) webrtc::BaseCapturerPipeWire::HandleBuffer(pw_buffer*) at /builds/worker/checkouts/gecko/third_party/libwebrtc/modules/desktop_capture/linux/wayland/moz_base_capturer_pipewire.cc:474

moz_base_capturer_pipewire.cc, logging.h

I've also asked Hopper for its opinion on the control flow in this function and it is similarly stumped:

libxul.so_sub_38062d0.pdf

My last hypothesis that's not "compiler bug" is undefined behavior: Maybe the compiler has proven that control flow would hit undefined behavior once the logging function returns, so it deleted the rest of the function, including the return. But, looking at the source code, I haven't been able to find any evidence of undefined behavior.

Gabriele Svelto · Answer 12 · Mon Jan 16 2023 17:28:18 GMT+0800 (China Standard Time)

I think this smells more like a compiler bug than anything else, the issue has been staring right at me in all the minidumps yet I failed to see it: the crash is caused by executing the padding zeros at the end of the function. I.e. the called function executes a ret and jumps back to the address that was left on the stack by the call instruction. The padding zeros are interpreted as add byte [rax], al and the memory pointed by rax is accessed. Since rax contains a non-canonical address it triggers a general protection fault, hence why we get a SIGSEGV / SI_KERNEL signal instead of the usual SIGSEGV / SEGV_MAPERR.

That being said the pre-processed lines of code that lead to this are the following:

  if (spaBuffer->datas[0].chunk->size == 0) {
    !rtc::LogMessage::IsNoop<::rtc::LS_ERROR>() && ::rtc::webrtc_logging_impl::LogCall() & ::rtc::webrtc_logging_impl::LogStreamer<>() << ::rtc::webrtc_logging_impl::LogMetadata("/home/gsvelto/projects/mozilla-central/third_party/libwebrtc/modules/desktop_capture/linux/wayland/moz_base_capturer_pipewire.cc", 474, ::rtc::LS_ERROR) << "Failed to get video stream: Zero size.";
    return;
  }

The logging code doesn't assert, abort or crash. Worst case it ignores the message and returns so execution should resume afterwards no matter what. I'm closing this and will be updating the bug on our side.