puma / puma

A Ruby/Rack web server built for parallelism

Home Page:https://puma.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large number of workers are booting too long or not booting the first time

snowboy932 opened this issue · comments

Describe the bug
I have some machines with 64C/128T CPU and 128GB RAM. I'm experiencing issues with running Rails application with Puma in it (using Docker, of course). When the number of workers exceeds 50 Puma stucks while booting. Not all the workers can be booted for the first time and I need to restart Docker container 1 or 2 times to make it boot properly.
When the issue exists, Puma control server shows, for example, 27 workers, but even they're not shown as booted. htop also shows only 27 workers, so that's not the control server metrics problem. In spite of all of this, Puma is able to accept incoming connections.
No errors appear during "bad boot" even with debug loglevel.

Puma config:

queue_requests false

preload_app!

before_fork do
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection_pool.disconnect!
  end
end

after_worker_fork do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Process.start(type: 'web')
end

after_worker_boot do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?
end

require 'myapplication'

on_worker_boot do
  if defined?(ActiveRecord::Base)
    ActiveSupport.on_load(:active_record) do
      ActiveRecord::Base.establish_connection
    end

    PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?

    require 'prometheus_exporter/instrumentation'
    PrometheusExporter::Instrumentation::ActiveRecord.start(
      custom_labels: { type: 'puma_worker' },
      config_labels: %i[database host],
    )
  end
end

I'm running with a docker_entrypoint.rb file via function in it described below

def web_server(app)
    socket_backlog = ENV.fetch('SOCKET_BACKLOG')
    port = ENV.fetch('PORT')
    default_bind = "tcp://0.0.0.0:#{port}?backlog=#{socket_backlog}"
    default_control_url = 'tcp://127.0.0.1:9293'
    default_control_token = 'DefaultPumaControlToken'

    run_command_within(
      app,
      'bundle exec puma',
      '-t :threads::threads -w :workers -e :env -b :bind --control-url :control_url --control-token :control_token',
      threads: ENV.fetch('THREADS'),
      workers: ENV.fetch('WORKERS'),
      env: ENV.fetch('RAILS_ENV'),
      bind: ENV.fetch('BIND', default_bind),
      control_url: ENV.fetch('PUMA_CONTROL_URL', default_control_url),
      control_token: ENV.fetch('PUMA_CONTROL_TOKEN', default_control_token),
    )
  end

# <...>

def run_command_within(app, command, options_string = '', options = {})
    validate_app(app)

    command = Cocaine::CommandLine.new(
      command,
      options_string,
    ).command(options)

    Dir.chdir("apps/#{app}")
    exec(command)
  end

I'm not sure abount running multiple containers with a less number of workers (balanced by nginx) as it creates some inconvinieces with logging.

Any ideas why does it happen?

Some additional info
I've reproduced the issue and noticed that the time of last checkin is not updating since the application deployment (it should update every 5 seconds, but the metrics below were copied more than 40 minutes ago. Also, the workers could be divided into groups of 1-3 workers. Each group started 60-90 seconds apart.

Current amount of workers: 100

{
   "started_at":"2024-04-01T10:00:40Z",
   "workers":27,
   "phase":0,
   "booted_workers":0,
   "old_workers":0,
   "worker_status":[
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":27,
         "index":0,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":34,
         "index":1,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":43,
         "index":2,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":80,
         "index":3,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":85,
         "index":4,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":116,
         "index":5,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":122,
         "index":6,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:02:16Z",
         "pid":128,
         "index":7,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:02:16Z",
         "last_status":{
            
         }
      },
     <...>
      {
         "started_at":"2024-04-01T10:05:16Z",
         "pid":468,
         "index":26,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:05:16Z",
         "last_status":{
            
         }
      }
   ],
   "versions":{
      "puma":"6.2.0",
      "ruby":{
         "engine":"ruby",
         "version":"3.1.4",
         "patchlevel":223
      }
   }
}

Some additional info 2
If amount of workers exceeds 100, I'm getting the following errors during "bad boot"
image

This TimeOut and Out-Of-Sync errors doesn't appear with less amount of workers and also doesn't appear every time when "bad boot" happens.

Environment

  • Server: (64C/128T) AMD EPYC 7713P
  • OS: CentOS 7, Linux kernel: 5.4.260
  • Docker version: 23.0.6
  • Puma Version 6.2.0
  • Ruby version: 3.1.4

Thanks in advance :)

Have you checked your system logs? Does they say something interesting?

Are you giving Docker enough memory? Maybe this #3193 is happening?

Maybe you can try #3236 (comment)

@dentarg

Have you checked your system logs? Does they say something interesting?

Decided to check it again - nothing in journalctl and dmesg :(

Are you giving Docker enough memory? Maybe this #3193 is happening?

Docker settings are set by default, I've only changed file descriptors limit in sysctl (/etc/sysctl.d/99-custom.conf) and ulimit (/etc/security/limits.conf). According to docker stats all the containers I have can use max memory (128Gi)

Maybe you can try #3236 (comment)

Just tried this one.
I've set 150 workers and 1 threads and made some tries to launch the container
The booting stagging is working as expected, but it stops booting and spawning new workers after 38-49 workers with 5 seconds interval (obviously). Last time it stopped on "workers":48 and "booted_workers":47. And, of course, got nothing new in system logs.

I think you need to dig deeper. Maybe https://github.com/amrabed/strace-docker can help

Can you try the same thing without Docker?