Puma cluster not reaping child processes when PID is 1 with Puma 6.4.1
stanhu opened this issue · comments
Describe the bug
We have a separate fleet of Puma workers to handle ActionCable, and since upgrading to 6.4.1 we have seen a significant increase of unhealthy pods and can't alloc thread
error messages. In addition, the pod's readiness checks started to fail, causing Kubernetes to periodically shutdown and restart the Puma container.
We rolled back to 6.4.0 and saw a dramatic drop in errors. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17372 for the full details.
Puma config:
# frozen_string_literal: true
# Load "path" as a rackup file.
#
# The default is "config.ru".
#
rackup '/srv/gitlab/config.ru'
pidfile "#{ENV['HOME']}/puma.pid"
state_path "#{ENV['HOME']}/puma.state"
stdout_redirect '/srv/gitlab/log/puma.stdout.log',
'/srv/gitlab/log/puma.stderr.log',
true
# Configure "min" to be the minimum number of threads to use to answer
# requests and "max" the maximum.
#
# The default is "0, 16".
#
threads (ENV['PUMA_THREADS_MIN'] ||= '1').to_i , (ENV['PUMA_THREADS_MAX'] ||= '16').to_i
# By default, workers accept all requests and queue them to pass to handlers.
# When false, workers accept the number of simultaneous requests configured.
#
# Queueing requests generally improves performance, but can cause deadlocks if
# the app is waiting on a request to itself. See https://github.com/puma/puma/issues/612
#
# When set to false this may require a reverse proxy to handle slow clients and
# queue requests before they reach puma. This is due to disabling HTTP keepalive
queue_requests false
# Bind the server to "url". "tcp://", "unix://" and "ssl://" are the only
# accepted protocols.
# We want to provide the ability to enable individually control HTTP (`INTERNAL_PORT`)
# HTTPS (`SSL_INTERNAL_PORT`):
#
# 1. HTTP on, HTTPS on: Since `INTERNAL_PORT` is configured, we listen on it.
# 2. HTTP on, HTTPS off: If we don't specify either port, we default to HTTP
# because SSL requires a certificate and key to work.
# 3. HTTP off, HTTPS on: `SSL_INTERNAL_PORT` is enabled but
# `INTERNAL_PORT` is not set.
http_port = ENV['INTERNAL_PORT'] || '8080'
http_addr =
if ENV['INTERNAL_PORT'] || (!ENV['INTERNAL_PORT'] && !ENV['SSL_INTERNAL_PORT'])
"0.0.0.0"
else
# If HTTP is disabled, we still need to listen to 127.0.0.1 for health checks.
"127.0.0.1"
end
bind "tcp://#{http_addr}:#{http_port}"
if ENV['SSL_INTERNAL_PORT']
ssl_params = {
cert: ENV['PUMA_SSL_CERT'],
key: ENV['PUMA_SSL_KEY'],
}
ssl_params[:ca] = ENV['PUMA_SSL_CLIENT_CERT'] if ENV['PUMA_SSL_CLIENT_CERT']
ssl_params[:key_password_command] = ENV['PUMA_SSL_KEY_PASSWORD_COMMAND'] if ENV['PUMA_SSL_KEY_PASSWORD_COMMAND']
ssl_params[:ssl_cipher_filter] = ENV['PUMA_SSL_CIPHER_FILTER'] if ENV['PUMA_SSL_CIPHER_FILTER']
ssl_params[:verify_mode] = ENV['PUMA_SSL_VERIFY_MODE'] || 'none'
ssl_bind '0.0.0.0', ENV['SSL_INTERNAL_PORT'], ssl_params
end
workers (ENV['WORKER_PROCESSES'] ||= '3').to_i
require "/srv/gitlab/lib/gitlab/cluster/lifecycle_events"
on_restart do
# Signal application hooks that we're about to restart
Gitlab::Cluster::LifecycleEvents.do_before_master_restart
end
before_fork do
# Signal application hooks that we're about to fork
Gitlab::Cluster::LifecycleEvents.do_before_fork
end
Gitlab::Cluster::LifecycleEvents.set_puma_options @config.options
on_worker_boot do
# Signal application hooks of worker start
Gitlab::Cluster::LifecycleEvents.do_worker_start
end
on_worker_shutdown do
# Signal application hooks that a worker is shutting down
Gitlab::Cluster::LifecycleEvents.do_worker_stop
end
# Preload the application before starting the workers; this conflicts with
# phased restart feature. (off by default)
preload_app!
tag 'gitlab-puma-worker'
# Verifies that all workers have checked in to the master process within
# the given timeout. If not the worker process will be restarted. Default
# value is 60 seconds.
#
worker_timeout (ENV['WORKER_TIMEOUT'] ||= '60').to_i
# https://github.com/puma/puma/blob/master/5.0-Upgrade.md#lower-latency-better-throughput
wait_for_less_busy_worker (ENV['PUMA_WAIT_FOR_LESS_BUSY_WORKER'] ||= '0.001').to_f
# Use json formatter
require "/srv/gitlab/lib/gitlab/puma_logging/json_formatter"
json_formatter = Gitlab::PumaLogging::JSONFormatter.new
log_formatter do |str|
json_formatter.call(str)
end
require "/srv/gitlab/lib/gitlab/puma/error_handler"
error_handler = Gitlab::Puma::ErrorHandler.new(ENV['RAILS_ENV'] == 'production')
lowlevel_error_handler do |ex, env, status_code|
error_handler.execute(ex, env, status_code)
end
Please copy-paste your Puma config AND your command line options here.
/srv/gitlab/bin/bundle exec puma --environment production --config /srv/gitlab/config/puma.rb /srv/gitlab/config.ru
Example process list:
$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 0 20:07 ? 00:00:34 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 41 1 0 20:07 ? 00:00:21 /usr/local/bin/gitlab-logger /var/log/gitlab
git 60 1 0 20:07 ? 00:00:47 ruby /srv/gitlab/bin/metrics-server
git 63 1 1 20:07 ? 00:00:54 puma: cluster worker 0: 1 [gitlab-puma-worker]
git 65 1 1 20:07 ? 00:00:55 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 67 1 1 20:07 ? 00:00:59 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 69 1 1 20:07 ? 00:00:55 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 546 0 0 21:37 pts/0 00:00:00 bash
git 556 546 0 21:37 pts/0 00:00:00 ps -ef
Note we are running Puma as PID 1. I don't believe --fork-worker
is being used.
To Reproduce
I'm working on a reproduction step right now. I suspect #3255 might have caused this issue. I didn't see reaped unknown child process
messages for this ActionCable fleet, though I did see it in other fleet of workers that didn't appear to have increased error rates.
One thing I observed is that previously wait_workers
could call Process.kill(0, w.pid)
to verify that each worker was still running:
Lines 515 to 523 in 52eff8d
Now that check only happens if fork_worker
is enabled?
Lines 565 to 573 in a287025
@casperisfine @nateberkopec What do you think about reverting #3255 or putting it behind some configuration parameter?
Expected behavior
No errors.
- Linux x86 running Puma 6.4.1 inside Debian bookworm container
- Google Kubernetes Engine
On my test instance with Puma 6.4.1, I ran kill -9 44
, and puma: cluster worker 0
did not come back:
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server
git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct>
git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 5205 0 0 21:57 pts/0 00:00:00 bash
git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef
With Puma 6.4.0, that worked fine:
git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 71 22:06 ? 00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 22 1 0 22:06 ? 00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git 36 0 0 22:07 pts/0 00:00:00 bash
git 65 1 22 22:07 ? 00:00:02 ruby /srv/gitlab/bin/metrics-server
git 68 1 22 22:07 ? 00:00:02 puma: cluster worker 0: 1 [gitlab-puma-worker]
git 70 1 22 22:07 ? 00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 72 1 22 22:07 ? 00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 74 1 22 22:07 ? 00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 148 36 0 22:07 pts/0 00:00:00 ps -ef
git@gitlab-webservice-default-78664bb757-97skg:/$ kill -9 68
git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 66 22:06 ? 00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 22 1 0 22:06 ? 00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git 36 0 0 22:07 pts/0 00:00:00 bash
git 65 1 16 22:07 ? 00:00:02 ruby /srv/gitlab/bin/metrics-server
git 70 1 16 22:07 ? 00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 72 1 17 22:07 ? 00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 74 1 16 22:07 ? 00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 149 1 19 22:07 ? 00:00:00 puma: cluster worker 0: 1 [gitlab-puma-worker]
git 165 36 0 22:07 pts/0 00:00:00 ps -ef
I added debugging messages, and it seems that Process.wait2(-1, Process::WNOHANG)
doesn't return anything when I ran kill <PID of worker>
. The process is in the defunct
state, so I'm a bit surprised that didn't work.
I applied this patch to get things working again:
diff --git a/lib/puma/cluster.rb b/lib/puma/cluster.rb
index 0d7c12bd..05d58445 100644
--- a/lib/puma/cluster.rb
+++ b/lib/puma/cluster.rb
@@ -562,7 +562,7 @@ module Puma
begin
# When `fork_worker` is enabled, some worker may not be direct children, but grand children.
# Because of this they won't be reaped by `Process.wait2(-1)`, so we need to check them individually)
- if reaped_children.delete(w.pid) || (@options[:fork_worker] && Process.wait(w.pid, Process::WNOHANG))
+ if reaped_children.delete(w.pid) || Process.wait(w.pid, Process::WNOHANG)
true
else
w.term if w.term?
I should note that on Linux Docker, PID 1 seems to work fine:
#!/bin/env ruby
fork do
loop { sleep 1 }
end
loop do
puts Process.wait2(-1, Process::WNOHANG)
sleep 1
end
My Dockerfile
:
FROM ruby:3.1
COPY listen.rb .
RUN chmod +x listen.rb
ENTRYPOINT ["/listen.rb"]
If I run this container and forcibly kill the child:
% docker exec -it 3b53dc0dcbd5 bash
root@3b53dc0dcbd5:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 23:36 pts/0 00:00:00 ruby /listen.rb
root 7 1 0 23:36 pts/0 00:00:00 ruby /listen.rb
root 8 0 0 23:36 pts/1 00:00:00 bash
root 14 8 0 23:36 pts/1 00:00:00 ps -ef
root@3b53dc0dcbd5:/# kill 7
root@3b53dc0dcbd5:/# %
I see:
7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
from /listen.rb:8:in `block in <main>'
from /listen.rb:7:in `loop'
from /listen.rb:7:in `<main>'
Strange, this worked fine with Kubernetes. I repeated the test above with a Google Kubernetes Engine pod:
% kubectl run listen-test --image=registry.gitlab.com/stanhu/lfs-test/listen-test:latest
pod/listen-test created
% kubectl exec -it listen-test -- bash
root@listen-test:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 05:44 ? 00:00:00 ruby /listen.rb
root 7 1 0 05:44 ? 00:00:00 ruby /listen.rb
root 21 0 0 05:46 pts/0 00:00:00 bash
root 27 21 0 05:46 pts/0 00:00:00 ps -ef
root@listen-test:/# kill 7
bash: kill: (7) - No such process
root@listen-test:/# command terminated with exit code 137
With kubectl logs -f listen-test
running, I see:
7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
from /listen.rb:8:in `block in <main>'
from /listen.rb:7:in `loop'
from /listen.rb:7:in `<main>'
I wonder what's different about Puma.
Still can't replicate this issue with a simple pod running Puma:
Dockerfile.puma
FROM ruby:3.1
RUN gem install puma:6.4.1
COPY hello.ru .
ENTRYPOINT ["puma", "hello.ru", "-w", "2"]
hello.ru
hdrs = {'Content-Type'.freeze => 'text/plain'.freeze}.freeze
body = ['Hello World'.freeze].freeze
run lambda { |env| [200, hdrs, body] }
Process.wait2(-1, Process::WNOHANG)
is working fine in a pod as PID 1. I tried with a non-root user as well.
Ok, our application has its own process supervisor that spawns a Prometheus metrics Web server. If I disable that, for some reason Process.wait2(-1, Process::WNOHANG)
works and reaps the processes properly.
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/lib/gitlab/process_supervisor.rb
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/lib/gitlab/process_management.rb
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/metrics_server/metrics_server.rb#L46-48
Most likely we're trapping SIGCHLD
and interfering with the wait
.
Most likely we're trapping
SIGCHLD
and interfering with thewait
.
I only skimmed the code you linked quickly (got hundreds of mail to catch up to today), but this sound weird.
I'm not sure why trapping SIGCHLD
would make the wait
fail. But I suppose at this stage it's best to try to come up with a smaller repro so we can better understand what's going on, and see what we could do to make this more resilient.
Yeah, I don't see any evidence we're trapping SIGCHLD
, and I've tried to add signal handlers to see if that changes anything. I can't reproduce the problem yet.
I did notice that the Ruby implementation for Process.wait2
seems to mention SIGCHLD
for some reason: https://github.com/ruby/ruby/blob/50c6cabadca44b7b034eae5dcc8017154a2858bd/process.c#L1343-L1348
Interesting, that was removed in 3.3: ruby/ruby#7527
Seems like it was converting a blocking waitpid
into a non blocking one by waiting for SIGCHLD, sounds quite brittle to me, but I don't have the full context.
Ok, it looks like in Ruby 3.1 and 3.2 Process.detach(<some PID != 1>
) appears to prevent Process.wait2(-1, Process::WNOHANG)
from finding child processes when the parent PID is 1.
The problem doesn't happen in Ruby 3.3. I wonder if ruby/ruby#7476 or ruby/ruby#7527 fixed this.
I'll update the comments in #3314 in light of this, so I think that pull request is still a good idea.
Here's a sample reproduction:
Dockerfile
FROM ruby:3.2
COPY listen.rb .
COPY test.sh .
RUN chmod +x listen.rb
RUN chmod +x test.sh
ENTRYPOINT ["/listen.rb"]
listen.rb
#!/bin/env ruby
fork do
loop { sleep 1 }
end
Process.spawn({}, "./test.sh", err: $stderr, out: $stdout, pgroup: true).tap do |pid|
STDERR.puts "detaching PID #{pid}"
Process.detach(pid)
end
loop do
STDERR.puts Process.wait2(-1, Process::WNOHANG)
sleep 1
end
test.sh
#!/bin/sh
sleep 600
It appears that Process.detach
simply spawns a separate thread that does a blocking waitpid
. I think this uses the SIGCHLD
implementation introduced in ruby/ruby@054a412. This comment in the commit message is telling:
We also work to suppress false-positives from Process.wait(-1, Process::WNOHANG) to quiets warnings from spec/ruby/core/process/wait2_spec.rb with MJIT enabled.
This makes me think that this code is suppressing child PIDs when the parent is PID 1.
I've confirmed that this Ruby SIGCHLD
business is responsible. I disabled WAITPID_USE_SIGCHLD
in my Ruby 3.1.4 interpreter, and Process.wait(-1, Process::WNOHANG)
started working again:
diff --git a/vm_core.h b/vm_core.h
index 1cc0659700..0e7d1643fe 100644
--- a/vm_core.h
+++ b/vm_core.h
@@ -126,7 +126,7 @@
#endif
/* define to 0 to test old code path */
-#define WAITPID_USE_SIGCHLD (RUBY_SIGCHLD || SIGCHLD_LOSSY)
+#define WAITPID_USE_SIGCHLD 0
#if defined(SIGSEGV) && defined(HAVE_SIGALTSTACK) && defined(SA_SIGINFO) && !defined(__NetBSD__)
# define USE_SIGALTSTACK
It appears that only Process.detach
is only needed; PID 1 is not relevant. This Ruby script will get stuck in Ruby 3.1.4 and 3.2.2, but exits immediately in Ruby 3.3.0:
#!/bin/env ruby
forked_pid = fork do
loop { sleep 1 }
end
Process.spawn({}, "sh -c 'sleep 60'", err: $stderr, out: $stdout).tap do |pid|
puts "detaching PID #{pid}"
Process.detach(pid)
end
child_waiter = Thread.new do
puts "Waiting for child process to die..."
# This works
# puts Process.wait2(forked_pid)
# This fails in Ruby 3.1 and 3.2
puts Process.wait2(-1)
end
process_killer = Thread.new do
puts "Killing #{forked_pid}"
system("kill #{forked_pid}")
end
child_waiter.join
process_killer.join
@stanhu I just realised I might be experiencing deja-vu with this thing. I have a bunch of notes/links at https://github.com/dentarg/gists/tree/master/gists/ruby-bug-15499#ruby--puma-bug about "the ruby 2.6.0 wait bug", some comments:
Looks like #1741 implemented a workaround. Has something changed in Ruby yet again?
@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke Process.waitpid
in a number of situations. With Ruby 3.3, that SIGCHLD
implementation is gone, so I wonder if all these wait
-related issues can be fixed without workarounds.
I see https://github.com/puma/puma/pull/1741/files#r266122715 mentions Process.waitpid(-1, Process::WNOHANG)
was not working, and #3255 introduced this in Puma v6.4.0. This seems to work okay until you use Process.detach
, but maybe there are more situations where it doesn't work.
I created https://bugs.ruby-lang.org/issues/20181, but I noticed https://bugs.ruby-lang.org/issues/19322 mentions this summary:
- Programs doing waitpid -1 are bad and wrong, nobody should ever do that, if any code in your program does this anywhere, then Ruby should no longer make any guarantees about subprocess management working correctly in the entire process.
- Programs doing waitpid -1 are widely deployed, it would be good if, when writing gems, there were APIs we could use which offer better isolation and composibility than the classic unix APIs, so that our gems work no matter what their containing processes are doing.
- Gems should never be spawning child processes anyway.
@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke
Process.waitpid
in a number of situations. With Ruby 3.3, thatSIGCHLD
implementation is gone, so I wonder if all thesewait
-related issues can be fixed without workarounds.
Found another thing that probably relates to this? Just wanted to connect the dots.
- https://github.com/puma/puma/pull/2767/files#r872970473
Well, I'm stuck. First of all, I don't like this dirty hack. Besides that, it goes into an infinity loop on the SIGCHLD, I don't know exactly why.
- https://github.com/puma/puma/pull/2767/files#r873130086
I've just tested this hack independently of puma and it works fine on ruby 2.5 and fails on all next versions
This bug was actually reported in https://bugs.ruby-lang.org/issues/19837 and fixed in the ruby_3_2
and ruby_3_1
stable branches, but there has yet to be a release with the fix.