ianlancetaylor / cgosymbolizer

Experimental symbolizer for cgo backtraces

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deadlocks when using cgosymbolizer, cpp exceptions and a trace agent with continuous profiling

dany74q opened this issue · comments

Hey !

We've recently tackled a recurring deadlock in one of our go applications - this app uses the following:

  • A continuous profiling agent (which calls pprof.StartCPUProfile under the hood every minute, i.e. sends SIGPROF)
  • A cgo code which wraps a 3rd party cpp library, which throws & catches various exceptions frequently
  • cgosymbolizer

We've noticed that, at least when compiling with gcc 8.3.0 and glibc 2.28-10 - calling _Unwind_Backtrace within the symbolizer leads to calling dl_iterate_phdr, which retrieves a shared lock using pthread_mutex_lock;
The lock is acquired in a non reentrant safe way, and thus leads to a flow which is not async-signal-safe and is prone to potential deadlocks.

The combination of cpp exceptions' stack unwinding (which also goes through dl_iterate_phdr) and our continuous SIGPROF-ing, which invokes cgosymbolizer - seems to be the culprit of the deadlocks we've had (for now, we've disabled cgosymbolizer, as our profiling data is used for monitoring).

Here's a sample gdb session in a dummy reproducer:

(gdb) info threads
  Id   Target Id                                        Frame
* 1    Thread 0x7ff536e39740 (LWP 8) "cgosymbolizer-d"  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
  2    Thread 0x7ff5101a7700 (LWP 9) "cgosymbolizer-d"  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
  3    Thread 0x7ff50f806700 (LWP 10) "cgosymbolizer-d" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
  4    Thread 0x7ff50f005700 (LWP 11) "cgosymbolizer-d" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  5    Thread 0x7ff50e7c4700 (LWP 12) "cgosymbolizer-d" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  6    Thread 0x7ff50dfc3700 (LWP 13) "cgosymbolizer-d" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520


(gdb) thread 4
[Switching to thread 4 (Thread 0x7ff50f005700 (LWP 11))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
103     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007ff53732a7d1 in __GI___pthread_mutex_lock (mutex=0x7ff537370990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007ff536f71a7f in __GI___dl_iterate_phdr (callback=callback@entry=0x4bd620 <phdr_callback>, data=data@entry=0x7ff50f004880) at dl-iteratephdr.c:40
#3  0x00000000004bd8ac in backtrace_initialize (state=state@entry=0x7ff537347000, filename=filename@entry=0x1976430 "./cgosymbolizer-deadlock",
    descriptor=<optimized out>, error_callback=error_callback@entry=0x4be5d0 <errorCallback>, data=data@entry=0x7ff50f004b10,
    fileline_fn=fileline_fn@entry=0x7ff50f004918) at elf.c:4894
#4  0x00000000004bdaaa in fileline_initialize (state=state@entry=0x7ff537347000, error_callback=error_callback@entry=0x4be5d0 <errorCallback>,
    data=data@entry=0x7ff50f004b10) at fileline.c:261
#5  0x00000000004bdb92 in backtrace_pcinfo (state=0x7ff537347000, pc=140691166044593, callback=0x4be510 <callback>, error_callback=0x4be5d0 <errorCallback>,
    data=0x7ff50f004b10) at fileline.c:295
#6  0x00000000004be66d in cgoSymbolizer (parg=0x7ff50f004b10) at symbolizer.c:106
#7  0x000000000046200d in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:795
#8  0x0000000000000000 in ?? ()
(gdb) frame 1
#1  0x00007ff53732a7d1 in __GI___pthread_mutex_lock (mutex=0x7ff537370990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115
115     ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) p mutex.__data
$1 = {__lock = 2, __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}

(gdb) thread 5
[Switching to thread 5 (Thread 0x7ff50e7c4700 (LWP 12))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
103     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007ff53732a7d1 in __GI___pthread_mutex_lock (mutex=0x7ff537370990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007ff536f71a7f in __GI___dl_iterate_phdr (callback=0x7ff5370110b0, data=0xc00008b4f0) at dl-iteratephdr.c:40
#3  0x00007ff537012361 in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#4  0x00007ff53700ea43 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#5  0x00007ff53700fc20 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#6  0x00007ff537010928 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00000000004be7ec in cgoTraceback (parg=0xc00008ba70, parg@entry=<error reading variable: value has been optimized out>) at traceback.c:82
#8  0x00000000004b2e06 in x_cgo_callers (sig=27, info=0xc00008bbf0, context=0xc00008bac0, cgoTraceback=<optimized out>, cgoCallers=<optimized out>,
    sigtramp=0x463de0 <runtime.sigtramp>) at gcc_traceback.c:42
#9  <signal handler called>
#10 0x00007ff53732a7c0 in __GI___pthread_mutex_lock (mutex=0x7ff537370990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:115
#11 0x00007ff536f71a7f in __GI___dl_iterate_phdr (callback=0x7ff5370110b0, data=0x7ff50e7c3770) at dl-iteratephdr.c:40
#12 0x00007ff537012361 in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#13 0x00007ff53700ea43 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#14 0x00007ff53700fe5d in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#15 0x00007ff537010391 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#16 0x00007ff53722eb27 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x0000000000403456 in throwAndCatch ()
#18 0x000000c00003a7d0 in ?? ()
#19 0x000000c00003a798 in ?? ()
#20 0x0000000000000000 in ?? ()

One can note that the lock is in an inconsistent state (owner isn't set) - and that the symbolizer was called after a x_cgo_callers w/ sig=27.

I've written a toy reproducer here: https://github.com/dany74q/cgosymbolizer-deadlock

I don't really have a concrete suggestion, I've seen that both dl_iterate_phdr and pthread_mutex_lock have a history of complaints of them being non async signal safe, and several suggestions to making them so were met with objections.

Would appreciate your input regardless, though.

Thanks !

Thanks for the note. You're right: this is a known problem with dl_iterate_phdr. I don't have a solution for it.

I'm running into a SIGSEGV when using cgosymbolizer with Go 1.17.5 and Ubuntu 18.04, which I think seems similar enough that it may be caused by the same problem. This is a Go application which uses C dynamic shared libraries using Cgo. With profiling turned on, it often crashes with the following stack, which shows it is being called with SIGPROF. I have been unable to reproduce this separately, so I'm not entirely sure WHY this is crashing. If anyone has suggestions for things I could try to narrow down the cause, I'd be happy to try them.

Thread that got SIGSEGV, according to the core dump:

#0  x86_64_fallback_frame_state (context=0xc004a2b040, context=0xc004a2b040, fs=0xc004a2b130) at ./md-unwind-support.h:63
#1  uw_frame_state_for (context=context@entry=0xc004a2b040, fs=fs@entry=0xc004a2b130) at ../../../src/libgcc/unwind-dw2.c:1265
#2  0x00007f95bef73098 in _Unwind_Backtrace (trace=trace@entry=0x163d950 <unwind>, trace_argument=trace_argument@entry=0xc004a2b2f0) at ../../../src/libgcc/unwind.inc:302
#3  0x000000000163da7f in cgoTraceback (parg=0xc004a2b320, parg@entry=<error reading variable: value has been optimized out>) at traceback.c:82
#4  0x00000000016320b6 in x_cgo_callers (sig=27, info=0xc004a2b4b0, context=0xc004a2b380, cgoTraceback=<optimized out>, cgoCallers=<optimized out>, sigtramp=0x483040 <runtime.sigtramp>) at gcc_traceback.c:42
#5  <signal handler called>
#6  __lll_unlock_wake_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:341
#7  0x00007f95becb26c0 in __check_pf (seen_ipv4=0x7f94fa861162, seen_ipv6=0x7f94fa861163, in6ai=0x7f94fa861170, in6ailen=0x7f94fa861178) at ../sysdeps/unix/sysv/linux/check_pf.c:341
#8  0xa36bde066291cb00 in ?? ()
#9  0x0000011000000000 in ?? ()
#10 0x000000c003b30700 in ?? ()
#11 0x000000c01cc958c0 in ?? ()
#12 0x00007f94fa8611e0 in ?? ()
#13 0x000000c00f57c500 in ?? ()
#14 0xffffffffffffffff in ?? ()
#15 0x00007f94fa861620 in ?? ()
#16 0x00007f95bec78e75 in __libc_use_alloca (size=1168231104512) at ../sysdeps/pthread/allocalim.h:27
#17 __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=<optimized out>, pai=<error reading variable: Cannot access memory at address 0xfffffffffffffaf8>) at ../sysdeps/posix/getaddrinfo.c:2338
Backtrace stopped: Cannot access memory at address 0x8

@evanj Thanks for the note. That looks like a different problem to me. That looks like the signal context passed to x86_64_fallback_frame_state has an invalid value in the pc field. I don't know what would cause that to happen.

Hmmm thanks! After a second look, I wonder if I might be running into some variant of issue with getaddrinfo (e.g. golang/go#30310). From a couple of different core dumps, this crash always happens on a thread in the same location, calling getaddrinfo. It appears this Go program must be using the C DNS resolver for some reason, and I see a number of operating system threads calling into it at the time of the crash.

The workaround for us is to not import cgosymbolizer, so this isn't "critical", but I will try to poke at it some more, since having C stacks in the profiles is really useful!