Manager can send SIGQUIT to worker while exiting, causing coredump

Question

Manager can send SIGQUIT to worker while exiting, causing coredump

brsakai-csco opened this issue a year ago · comments

brsakai-csco commented a year ago

Mojolicious version: 9.22, Waffle
Perl version: v5.34.0
Operating system: Ubuntu 22.04.2 LTS

Steps to reproduce the behavior

Create and start simple Hypnotoad app
Send a SIGQUIT to a worker process to end it gracefully
The worker's manager will send a second SIGQUIT
Depending on timing, the worker will reset its signal handlers to SIG_DFL before catching the second SIGQUIT, causing it to dump its core

In more detail:

Unpack the attached tarfile which has my app and update mojo_server_wrapper.pl to have the correct $ENV{REPRO_PATH} (you could most likely use any Hypnotoad/Prefork app, but I figured a working example would be best)

hypno.tar.gz

Start the app from terminal (A)

brsakai@brsakai-1151:~/hypno$ ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started

strace one of the workers from another terminal (B)

brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8,

Issue a kill -QUIT from a third terminal (C)

brsakai@brsakai-1151:~$ kill -QUIT 29199

Observe that the kill signal is received in the strace output (note that two SIGQUIT are received, one from terminal (C) and one from the manager)

brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8, 0x561e175cd970, 64, 3408) = -1 EINTR (Interrupted system call)
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29085, si_uid=1000} --- <<< We catch the SIGQUIT from terminal (C) and start to exit
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
epoll_wait(8, [], 64, 904)              = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [QUIT], NULL, 8) = 0
getpid()                                = 29199
write(5, "29199:1\n", 8)                = 8
rt_sigprocmask(SIG_BLOCK, [TTOU], [], 8) = 0
rt_sigaction(SIGTTOU, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [TTIN], [], 8) = 0
rt_sigaction(SIGTTIN, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=0x561e14f54450, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0 <<< SIGQUIT handler is uninstalled
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29198, si_uid=1000} --- <<< We catch the SIGQUIT from the manager, and coredump
+++ killed by SIGQUIT (core dumped) +++

The manager will print status information about the killed worker

brsakai@brsakai-1151:~/hypno$ PERL5LIB=/home/brsakai/hypno ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started
[2023-03-09 19:04:17.10105] [29198] [info] Stopping worker 29199 gracefully (30 seconds)
[2023-03-09 19:04:17.10365] [29210] [info] Worker 29210 started
[2023-03-09 19:04:17.21945] [29198] [info] Worker 29199 stopped

Expected behavior

The app should not generate a coredump when it is sent a graceful shutdown signal

Actual behavior

The second SIGQUIT from the manager process is being caught by SIG_DFL and creating a lot of cores, filling disk space on my server

Open comments

Issue first seen on Supervillain in our product

# mojo version
CORE
  Perl        (v5.32.1, linux)
  Mojolicious (8.09, Supervillain)

OPTIONAL
  Cpanel::JSON::XS 4.04+  (4.08)
  EV 4.0+                 (4.17)
  IO::Socket::Socks 0.64+ (0.74)
  IO::Socket::SSL 2.009+  (2.067)
  Net::DNS::Native 0.15+  (0.18)
  Role::Tiny 2.000001+    (2.000006)

You might want to update your Mojolicious to 9.31!

Reproduced on Ubuntu in a fresh install

brsakai@brsakai-1151:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy
brsakai@brsakai-1151:~$ # Installed via sudo apt install libmojolicious-perl
brsakai@brsakai-1151:~$ mojo version
CORE
  Perl        (v5.34.0, linux)
  Mojolicious (9.22, Waffle)

OPTIONAL
  Cpanel::JSON::XS 4.09+   (4.27)
  EV 4.32+                 (4.33)
  IO::Socket::Socks 0.64+  (0.74)
  IO::Socket::SSL 2.009+   (2.074)
  Net::DNS::Native 0.15+   (n/a)
  Role::Tiny 2.000001+     (2.002004)
  Future::AsyncAwait 0.52+ (0.56)

You might want to update your Mojolicious to 9.31!

I suspect that this is due to Perl deregistering all signal handlers when exit is called at the end of Mojo::Server::Prefork::_spawn, so the second SIGQUIT is being caught by the default handler (core dump) instead of the Perl handler (graceful stop). This is a little racy, since we can sometimes get the SIGQUIT from the manager before uninstalling the signal handler, so you might need to try the repro a couple times.

Possible suggestions:

Manager could not send a second SIGQUIT (not sure why this is triggered)
We could switch "graceful shutdown" signal to SIGTERM or something that does not create coredumps when the signal handlers are deregistered at exit
Add some sort of delay in _spawn so we catch any late signals before uninstalling the signal handlers during exit

brsakai-csco · Answer 1 · Wed Mar 22 2023 01:17:38 GMT+0800 (China Standard Time)

I was able to prevent this behavior with the following patch

--- Server/Prefork.pm.bak       2023-03-21 16:51:31.027344078 +0000
+++ Server/Prefork.pm   2023-03-21 16:54:24.053345048 +0000
@@ -194,6 +194,7 @@
     next unless my $w = $self->{pool}{$1};
     @$w{qw(healthy time)} = (1, $time) and $self->emit(heartbeat => $1);
     $w->{graceful} ||= $time if $2;
+    $w->{quit}++ if $2;
   }
 }

As I understand it, the control flow here is:

Mojo server creates a pipe in Mojo::Server::Prefork::run, which is used for heartbeats between the manager and worker
Mojo::Server::Prefork::_spawn sets the signal handler for SIGQUIT to $self->ioloop->stop_gracefully
Upon receipt of SIGQUIT, the worker emits the finished signal from Mojo::IOLoop::stop_gracefully
Upon receipt of the finish signal (from itself? Not clear), the worker sets $finished to 1 in Mojo::Server::Prefork::_spawn
The worker sends a heartbeat from Mojo::Server::Prefork::_heartbeat with $finished set to 1
The manager receives the heartbeat message and sets $self->{pool}{$pid}->{graceful} equal to the worker's $finished in Mojo::Server::Prefork::_wait
The manager sends a SIGQUIT to the worker and sets $self->{pool}{$pid}->{quit} to nonzero (preventing further SIGQUITs)

This patch "combines" steps 7 and 8, so that $self->{pool}{$pid}->{quit} is updated when the manger is notified of the graceful shutdown. This builds in the following assumptions

There is no scenario in which the worker reports a graceful shutdown via its heartbeat but still needs a SIGQUIT to actually shut down
The {quit} field isn't used for anything else

Assumption (2) appears to be fine, but I'm not sure about (1), would appreciate some input from you folks on that point

This corresponds to (1) in the Possible Suggestions from the original comment

jixam · Answer 2 · Wed Jun 07 2023 20:21:30 GMT+0800 (China Standard Time)

Also see #1449 (with a proposed fix in #1883).

Sebastian Riedel · Answer 3 · Fri Jun 09 2023 20:55:00 GMT+0800 (China Standard Time)

This does indeed seem like the correct solution for the problem from #1883. Please make this a PR.