Large number of workers are booting too long or not booting the first time

Question

Large number of workers are booting too long or not booting the first time

snowboy932 opened this issue 3 months ago · comments

Describe the bug
I have some machines with 64C/128T CPU and 128GB RAM. I'm experiencing issues with running Rails application with Puma in it (using Docker, of course). When the number of workers exceeds 50 Puma stucks while booting. Not all the workers can be booted for the first time and I need to restart Docker container 1 or 2 times to make it boot properly.
When the issue exists, Puma control server shows, for example, 27 workers, but even they're not shown as booted. htop also shows only 27 workers, so that's not the control server metrics problem. In spite of all of this, Puma is able to accept incoming connections.
No errors appear during "bad boot" even with debug loglevel.

Puma config:

queue_requests false

preload_app!

before_fork do
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection_pool.disconnect!
  end
end

after_worker_fork do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Process.start(type: 'web')
end

after_worker_boot do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?
end

require 'myapplication'

on_worker_boot do
  if defined?(ActiveRecord::Base)
    ActiveSupport.on_load(:active_record) do
      ActiveRecord::Base.establish_connection
    end

    PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?

    require 'prometheus_exporter/instrumentation'
    PrometheusExporter::Instrumentation::ActiveRecord.start(
      custom_labels: { type: 'puma_worker' },
      config_labels: %i[database host],
    )
  end
end

I'm running with a docker_entrypoint.rb file via function in it described below

def web_server(app)
    socket_backlog = ENV.fetch('SOCKET_BACKLOG')
    port = ENV.fetch('PORT')
    default_bind = "tcp://0.0.0.0:#{port}?backlog=#{socket_backlog}"
    default_control_url = 'tcp://127.0.0.1:9293'
    default_control_token = 'DefaultPumaControlToken'

    run_command_within(
      app,
      'bundle exec puma',
      '-t :threads::threads -w :workers -e :env -b :bind --control-url :control_url --control-token :control_token',
      threads: ENV.fetch('THREADS'),
      workers: ENV.fetch('WORKERS'),
      env: ENV.fetch('RAILS_ENV'),
      bind: ENV.fetch('BIND', default_bind),
      control_url: ENV.fetch('PUMA_CONTROL_URL', default_control_url),
      control_token: ENV.fetch('PUMA_CONTROL_TOKEN', default_control_token),
    )
  end

# <...>

def run_command_within(app, command, options_string = '', options = {})
    validate_app(app)

    command = Cocaine::CommandLine.new(
      command,
      options_string,
    ).command(options)

    Dir.chdir("apps/#{app}")
    exec(command)
  end

I'm not sure abount running multiple containers with a less number of workers (balanced by nginx) as it creates some inconvinieces with logging.

Any ideas why does it happen?

Some additional info
I've reproduced the issue and noticed that the time of last checkin is not updating since the application deployment (it should update every 5 seconds, but the metrics below were copied more than 40 minutes ago. Also, the workers could be divided into groups of 1-3 workers. Each group started 60-90 seconds apart.

Current amount of workers: 100

{
   "started_at":"2024-04-01T10:00:40Z",
   "workers":27,
   "phase":0,
   "booted_workers":0,
   "old_workers":0,
   "worker_status":[
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":27,
         "index":0,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":34,
         "index":1,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":43,
         "index":2,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":80,
         "index":3,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":85,
         "index":4,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":116,
         "index":5,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":122,
         "index":6,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:02:16Z",
         "pid":128,
         "index":7,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:02:16Z",
         "last_status":{
            
         }
      },
     <...>
      {
         "started_at":"2024-04-01T10:05:16Z",
         "pid":468,
         "index":26,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:05:16Z",
         "last_status":{
            
         }
      }
   ],
   "versions":{
      "puma":"6.2.0",
      "ruby":{
         "engine":"ruby",
         "version":"3.1.4",
         "patchlevel":223
      }
   }
}

Some additional info 2
If amount of workers exceeds 100, I'm getting the following errors during "bad boot"

This TimeOut and Out-Of-Sync errors doesn't appear with less amount of workers and also doesn't appear every time when "bad boot" happens.

Environment

Server: (64C/128T) AMD EPYC 7713P
OS: CentOS 7, Linux kernel: 5.4.260
Docker version: 23.0.6
Puma Version 6.2.0
Ruby version: 3.1.4

Thanks in advance :)

Patrik Ragnarsson · Answer 1 · Tue Apr 02 2024 05:49:06 GMT+0800 (China Standard Time)

Have you checked your system logs? Does they say something interesting?

Patrik Ragnarsson · Answer 2 · Tue Apr 02 2024 05:52:29 GMT+0800 (China Standard Time)

Are you giving Docker enough memory? Maybe this #3193 is happening?

Maybe you can try #3236 (comment)

Dmitry Gushchin · Answer 3 · Tue Apr 02 2024 20:07:59 GMT+0800 (China Standard Time)

@dentarg

Have you checked your system logs? Does they say something interesting?

Decided to check it again - nothing in journalctl and dmesg :(

Are you giving Docker enough memory? Maybe this #3193 is happening?

Docker settings are set by default, I've only changed file descriptors limit in sysctl (/etc/sysctl.d/99-custom.conf) and ulimit (/etc/security/limits.conf). According to docker stats all the containers I have can use max memory (128Gi)

Maybe you can try #3236 (comment)

Just tried this one.
I've set 150 workers and 1 threads and made some tries to launch the container
The booting stagging is working as expected, but it stops booting and spawning new workers after 38-49 workers with 5 seconds interval (obviously). Last time it stopped on "workers":48 and "booted_workers":47. And, of course, got nothing new in system logs.

Patrik Ragnarsson · Answer 4 · Thu Apr 04 2024 01:42:47 GMT+0800 (China Standard Time)

I think you need to dig deeper. Maybe https://github.com/amrabed/strace-docker can help

Can you try the same thing without Docker?