Manager can send SIGQUIT to worker while exiting, causing coredump
brsakai-csco opened this issue · comments
- Mojolicious version: 9.22, Waffle
- Perl version: v5.34.0
- Operating system: Ubuntu 22.04.2 LTS
Steps to reproduce the behavior
- Create and start simple Hypnotoad app
- Send a
SIGQUIT
to a worker process to end it gracefully - The worker's manager will send a second
SIGQUIT
- Depending on timing, the worker will reset its signal handlers to
SIG_DFL
before catching the secondSIGQUIT
, causing it to dump its core
In more detail:
Unpack the attached tarfile which has my app and update mojo_server_wrapper.pl
to have the correct $ENV{REPRO_PATH}
(you could most likely use any Hypnotoad/Prefork app, but I figured a working example would be best)
Start the app from terminal (A)
brsakai@brsakai-1151:~/hypno$ ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started
strace
one of the workers from another terminal (B)
brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8,
Issue a kill -QUIT
from a third terminal (C)
brsakai@brsakai-1151:~$ kill -QUIT 29199
Observe that the kill signal is received in the strace output (note that two SIGQUIT are received, one from terminal (C) and one from the manager)
brsakai@brsakai-1151:~$ sudo strace -p 29199
strace: Process 29199 attached
epoll_wait(8, 0x561e175cd970, 64, 3408) = -1 EINTR (Interrupted system call)
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29085, si_uid=1000} --- <<< We catch the SIGQUIT from terminal (C) and start to exit
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
epoll_wait(8, [], 64, 904) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [QUIT], NULL, 8) = 0
getpid() = 29199
write(5, "29199:1\n", 8) = 8
rt_sigprocmask(SIG_BLOCK, [TTOU], [], 8) = 0
rt_sigaction(SIGTTOU, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [TTIN], [], 8) = 0
rt_sigaction(SIGTTIN, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, {sa_handler=0x561e14f54450, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f87dc42e520}, 8) = 0 <<< SIGQUIT handler is uninstalled
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=29198, si_uid=1000} --- <<< We catch the SIGQUIT from the manager, and coredump
+++ killed by SIGQUIT (core dumped) +++
The manager will print status information about the killed worker
brsakai@brsakai-1151:~/hypno$ PERL5LIB=/home/brsakai/hypno ./mojo_server_wrapper.pl
[2023-03-09 19:03:32.09625] [29198] [info] Listening at "http://127.0.0.1:8080"
Web application available at http://127.0.0.1:8080
[2023-03-09 19:03:32.09659] [29198] [info] Manager 29198 started
[2023-03-09 19:03:32.09910] [29199] [info] Worker 29199 started
[2023-03-09 19:03:32.10025] [29200] [info] Worker 29200 started
[2023-03-09 19:03:32.10118] [29201] [info] Worker 29201 started
[2023-03-09 19:03:32.10202] [29202] [info] Worker 29202 started
[2023-03-09 19:03:32.10239] [29198] [info] Creating process id file "/tmp/mojo_server.pid"
[2023-03-09 19:03:32.10248] [29203] [info] Worker 29203 started
[2023-03-09 19:04:17.10105] [29198] [info] Stopping worker 29199 gracefully (30 seconds)
[2023-03-09 19:04:17.10365] [29210] [info] Worker 29210 started
[2023-03-09 19:04:17.21945] [29198] [info] Worker 29199 stopped
Expected behavior
The app should not generate a coredump when it is sent a graceful shutdown signal
Actual behavior
The second SIGQUIT from the manager process is being caught by SIG_DFL
and creating a lot of cores, filling disk space on my server
Open comments
Issue first seen on Supervillain in our product
# mojo version
CORE
Perl (v5.32.1, linux)
Mojolicious (8.09, Supervillain)
OPTIONAL
Cpanel::JSON::XS 4.04+ (4.08)
EV 4.0+ (4.17)
IO::Socket::Socks 0.64+ (0.74)
IO::Socket::SSL 2.009+ (2.067)
Net::DNS::Native 0.15+ (0.18)
Role::Tiny 2.000001+ (2.000006)
You might want to update your Mojolicious to 9.31!
Reproduced on Ubuntu in a fresh install
brsakai@brsakai-1151:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
brsakai@brsakai-1151:~$ # Installed via sudo apt install libmojolicious-perl
brsakai@brsakai-1151:~$ mojo version
CORE
Perl (v5.34.0, linux)
Mojolicious (9.22, Waffle)
OPTIONAL
Cpanel::JSON::XS 4.09+ (4.27)
EV 4.32+ (4.33)
IO::Socket::Socks 0.64+ (0.74)
IO::Socket::SSL 2.009+ (2.074)
Net::DNS::Native 0.15+ (n/a)
Role::Tiny 2.000001+ (2.002004)
Future::AsyncAwait 0.52+ (0.56)
You might want to update your Mojolicious to 9.31!
I suspect that this is due to Perl deregistering all signal handlers when exit
is called at the end of Mojo::Server::Prefork::_spawn
, so the second SIGQUIT
is being caught by the default handler (core dump) instead of the Perl handler (graceful stop). This is a little racy, since we can sometimes get the SIGQUIT
from the manager before uninstalling the signal handler, so you might need to try the repro a couple times.
Possible suggestions:
- Manager could not send a second
SIGQUIT
(not sure why this is triggered) - We could switch "graceful shutdown" signal to
SIGTERM
or something that does not create coredumps when the signal handlers are deregistered at exit - Add some sort of delay in
_spawn
so we catch any late signals before uninstalling the signal handlers duringexit
I was able to prevent this behavior with the following patch
--- Server/Prefork.pm.bak 2023-03-21 16:51:31.027344078 +0000
+++ Server/Prefork.pm 2023-03-21 16:54:24.053345048 +0000
@@ -194,6 +194,7 @@
next unless my $w = $self->{pool}{$1};
@$w{qw(healthy time)} = (1, $time) and $self->emit(heartbeat => $1);
$w->{graceful} ||= $time if $2;
+ $w->{quit}++ if $2;
}
}
As I understand it, the control flow here is:
- Mojo server creates a pipe in
Mojo::Server::Prefork::run
, which is used for heartbeats between the manager and worker - Mojo::Server::Prefork::_spawn sets the signal handler for SIGQUIT to
$self->ioloop->stop_gracefully
- Upon receipt of SIGQUIT, the worker emits the finished signal from
Mojo::IOLoop::stop_gracefully
- Upon receipt of the finish signal (from itself? Not clear), the worker sets
$finished
to 1 inMojo::Server::Prefork::_spawn
- The worker sends a heartbeat from
Mojo::Server::Prefork::_heartbeat
with$finished
set to 1 - The manager receives the heartbeat message and sets
$self->{pool}{$pid}->{graceful}
equal to the worker's$finished
inMojo::Server::Prefork::_wait
- The manager sends a SIGQUIT to the worker and sets
$self->{pool}{$pid}->{quit}
to nonzero (preventing further SIGQUITs)
This patch "combines" steps 7 and 8, so that $self->{pool}{$pid}->{quit}
is updated when the manger is notified of the graceful shutdown. This builds in the following assumptions
- There is no scenario in which the worker reports a graceful shutdown via its heartbeat but still needs a SIGQUIT to actually shut down
- The
{quit}
field isn't used for anything else
Assumption (2) appears to be fine, but I'm not sure about (1), would appreciate some input from you folks on that point
This corresponds to (1) in the Possible Suggestions from the original comment
This does indeed seem like the correct solution for the problem from #1883. Please make this a PR.