Monit always waits for puma to complete restart until its timeout after deployment

Question

Monit always waits for puma to complete restart until its timeout after deployment

kzkn opened this issue 3 years ago · comments

Kazuki Nishikawa commented 3 years ago

I use v1.20.3 and puma and delayed_jobs.

Expected Behavior

Wait for puma to finish starting, not until its timeout.

Actual Behavior

Monit always waits for puma to complete restart until its timeout (90 seconds).
This will cause a delay in the subsequent restarting delayed_jobs processes.

Oct 14 17:41:25 myapp-server monit[920]: 'myapp-server' Monit reloaded
Oct 14 17:42:01 myapp-server CRON[27137]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:42:01 myapp-server CRON[27138]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:42:01 myapp-server CRON[27137]: pam_unix(cron:session): session closed for user root
Oct 14 17:42:15 myapp-server sshd[27049]: Connection reset by 222.187.238.58 port 38644 [preauth]
Oct 14 17:42:49 myapp-server crontab[27235]: (deploy) LIST (deploy)
Oct 14 17:42:50 myapp-server crontab[27236]: (deploy) REPLACE (deploy)
Oct 14 17:42:50 myapp-server monit[27237]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' restart on user request
Oct 14 17:42:50 myapp-server monit[920]: Monit daemon with PID 920 awakened
Oct 14 17:42:50 myapp-server monit[920]: Awakened by User defined signal 1
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' trying to restart
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' stop: '/bin/sh -c cat /run/lock/rails/puma.pid | xargs --no-run-if-empty kill -QUIT; sleep 5'
Oct 14 17:42:50 myapp-server monit[27244]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 14 17:42:50 myapp-server monit[920]: 'delayed_job_rails-1' restart on user request
Oct 14 17:42:50 myapp-server monit[920]: Monit daemon with PID 920 awakened
Oct 14 17:42:50 myapp-server systemd[1]: Reloading nginx - high performance web server.
-- Subject: Unit nginx.service has begun reloading its configuration
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit nginx.service has begun reloading its configuration
Oct 14 17:42:50 myapp-server systemd[1]: Reloaded nginx - high performance web server.
-- Subject: Unit nginx.service has finished reloading its configuration
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit nginx.service has finished reloading its configuration
-- 
-- The result is RESULT.
Oct 14 17:42:50 myapp-server sudo[26637]: pam_unix(sudo:session): session closed for user root
Oct 14 17:42:55 myapp-server monit[920]: 'puma_rails' start: '/bin/sh -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] Puma starting in cluster mode...
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Puma version: 5.5.2 (ruby 2.7.4-p191) ("Zawgyi")
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *  Min threads: 0
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *  Max threads: 16
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *  Environment: staging
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *   Master PID: 27280
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *      Workers: 2
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] *     Restarts: (✔) hot (✖) phased
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Preloading application
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] * Listening on unix:///srv/www/rails/shared/sockets/puma.sock
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] ! WARNING: Detected 1 Thread(s) started in app boot:
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] ! #<Rack::MiniProfiler::MemoryStore::CacheCleanupThread:0x000055fed3e38ca8 /srv/www/rails/shared/vendor/bundle/ruby/2.7.0/g
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] Use Ctrl-C to stop
Oct 14 17:43:01 myapp-server cron[12893]: (deploy) RELOAD (crontabs/deploy)
Oct 14 17:43:01 myapp-server CRON[27313]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:43:01 myapp-server CRON[27314]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:43:01 myapp-server CRON[27313]: pam_unix(cron:session): session closed for user root
Oct 14 17:44:01 myapp-server CRON[27345]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:44:01 myapp-server CRON[27346]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:44:01 myapp-server CRON[27345]: pam_unix(cron:session): session closed for user root
Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' failed to start (exit status -1) -- Program '/bin/sh -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' restart action failed
Oct 14 17:44:25 myapp-server monit[920]: 'delayed_job_rails-1' trying to restart
Oct 14 17:44:25 myapp-server monit[920]: 'delayed_job_rails-1' stop: '/bin/su - deploy -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:44:25 myapp-server su[27358]: Successful su for deploy by root
Oct 14 17:44:25 myapp-server su[27358]: + ??? root:deploy
Oct 14 17:44:25 myapp-server su[27358]: pam_unix(su:session): session opened for user deploy by (uid=0)

Custom JSON

"appserver": {
    "dot_env": true,
    "worker_processes": 2
},
"worker": {
    "adapter": "delayed_job",
    "process_count": 1
}

Chef log including error

Igor Rzegocki · Answer 1 · Thu Oct 14 2021 17:22:07 GMT+0800 (China Standard Time)

@olbrich Does the fix you created recently applies here (I mean, does it resolve the issue)? Or is it something unrelated?

Kevin Olbrich · Answer 2 · Fri Oct 15 2021 00:36:21 GMT+0800 (China Standard Time)

@ajgon I don't think I did anything that would explicitly fix this, but it's possible that it improved because the monit handling is a bit better now. I would recommend trying to reproduce with the latest version.

Also the line

Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' restart action failed

In the output makes me thing that the deploy is failing, so I'm not entirely sure what happens in that situation. I think we stop workers before a deploy and it may be that monit eventually figures out that the process should be running and tries to restart it on a failed deploy.

Kazuki Nishikawa · Answer 3 · Fri Oct 15 2021 13:33:25 GMT+0800 (China Standard Time)

I will try v1.21.1, thank!

In the output makes me thing that the deploy is failing, so I'm not entirely sure what happens in that situation.

In my case, the deploy is not failing. The deploy is success, puma and delayed_job processes are running correctly.

Kazuki Nishikawa · Answer 4 · Mon Oct 18 2021 20:46:32 GMT+0800 (China Standard Time)

I tried v1.21.1, I got fail to start puma process on create new instance 😢
Outputs:

ubuntu@myapp-server:~$ sudo monit summary
┌─────────────────────────────────┬────────────────────────────┬───────────────┐
│ Service Name                    │ Status                     │ Type          │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ myapp-server                    │ OK                         │ System        │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ puma_rails                      │ Execution failed | Does... │ Process       │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent                  │ OK                         │ Process       │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-master-running   │ OK                         │ Process       │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ delayed_job_rails-1             │ OK                         │ Process       │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-statistic-dae... │ OK                         │ File          │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-keep-alive-da... │ OK                         │ File          │
└─────────────────────────────────┴────────────────────────────┴───────────────┘

ubuntu@myapp-server:~$ sudo monit status puma_rails
Process 'puma_rails'
  status                       Execution failed | Does not exist
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  data collected               Mon, 18 Oct 2021 12:37:44

ubuntu@myapp-server:~$ sudo journalctl | grep puma_rails
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' process is not running
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' trying to restart
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...'
Oct 18 21:35:28 myapp-server monit[961]: 'puma_rails' failed to restart (exit status 2) -- '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' start on user request
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' process is not running
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' trying to restart
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...'
Oct 18 21:37:44 myapp-server monit[961]: 'puma_rails' failed to restart (exit status 0) --
Oct 18 21:38:44 myapp-server monit[961]: 'puma_rails' process is not running

ubuntu@myapp-server:~$ ps auxw|grep puma
ubuntu   26696  0.0  0.0  14860  1056 pts/0    S+   21:47   0:00 grep --color=auto puma

ubuntu@myapp-server:~$ cat /run/lock/rails/puma.pid
cat: /run/lock/rails/puma.pid: No such file or directory

Kevin Olbrich · Answer 5 · Mon Oct 18 2021 23:32:02 GMT+0800 (China Standard Time)

Do the puma logs give you any indication of what the failure was?

Kazuki Nishikawa · Answer 6 · Tue Oct 19 2021 12:52:04 GMT+0800 (China Standard Time)

@olbrich
The puma log file did not exist. Is there anything else I should look at?

# puma.rb
pidfile  "/run/lock/rails/puma.pid"
state_path "/run/lock/rails/puma.state"
stdout_redirect "/srv/www/rails/shared/log/puma.stdout.log", "/srv/www/rails/shared/log/puma.stderr.log", true

deploy@myapp-server:~$ cat /run/lock/rails/puma.pid
cat: /run/lock/rails/puma.pid: No such file or directory
deploy@myapp-server:~$ cat /run/lock/rails/puma.state
cat: /run/lock/rails/puma.state: No such file or directory
deploy@myapp-server:~$ cat /srv/www/rails/shared/log/puma.stdout.log
cat: /srv/www/rails/shared/log/puma.stdout.log: No such file or directory
deploy@myapp-server:~$ cat /srv/www/rails/shared/log/puma.stderr.log
cat: /srv/www/rails/shared/log/puma.stderr.log: No such file or directory

Kazuki Nishikawa · Answer 7 · Tue Oct 19 2021 20:54:26 GMT+0800 (China Standard Time)

pumactl restart is failed because puma.state file does not exists.
https://github.com/puma/puma/blob/a6cdca85a15b2c277f64a5fa7572effb786fa9a1/lib/puma/control_cli.rb#L148

Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' process is not running
Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' trying to restart
Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 19 21:48:34 myapp-server puma-rails[15131]: State file not found: /run/lock/rails/puma.state

Kazuki Nishikawa · Answer 8 · Tue Oct 19 2021 21:10:16 GMT+0800 (China Standard Time)

It looks like the monit start on the first deploy is not executed.
When I manually run sudo monit start puma_rails, I have seen the following logs:

Oct 19 22:00:07 myapp-server monit[923]: 'puma_rails' start: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...

Puma will run successfully after the above log. puma.pid, puma.state files are created.
sudo journalctl | grep "'puma_rails' start:" hits only logs that ran monit start manually.

Kazuki Nishikawa · Answer 9 · Tue Oct 19 2021 23:29:14 GMT+0800 (China Standard Time)

This seems to be similar to the problem we addressed in #251.

Oct 19 23:03:54 myapp-server monit[960]: 'myapp-server' Monit reloaded
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' process is not running
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' trying to restart
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 19 23:05:24 myapp-server monit[960]: 'puma_rails' failed to restart (exit status 0) -- '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
...
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' start on user request
Oct 19 23:06:07 myapp-server monit[960]: Monit daemon with PID 960 awakened
Oct 19 23:06:07 myapp-server monit[960]: Awakened by User defined signal 1
Oct 19 23:06:07 myapp-server monit[960]: Reinitializing Monit -- control file '/etc/monit/monitrc'
Oct 19 23:06:07 myapp-server monit[25881]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 19 23:06:07 myapp-server monit[960]: 'delayed_job_rails-1' restart on user request
Oct 19 23:06:07 myapp-server monit[960]: Monit daemon with PID 960 awakened
Oct 19 23:06:07 myapp-server monit[960]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 19 23:06:07 myapp-server monit[960]: 'myapp-server' Monit reloaded
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' process is not running
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' trying to restart
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...

The monit start puma_rails is requested, but actually monit restart is working instead of monit start.
It looks like to be cancelled because it is Reinitializing Monit immediately after requesting monit start.
After that, it looks like monit is repeatedly monit restart by itself without the puma.state file.

I assume this is because monit reload twice, once for puma and once for delayed_job.
The monit reload is now called immediately via notifies from the configure hook of appserver and worker.
When I made the change to call monit reload only once, the problem goes away.

Another approach that worked well was to change the command set for the restart program in puma's monitrc, which dynamically switches between pumactl start and pumactl restart depending on the presence of the puma.state file. See this diff for details.

@ajgon @olbrich
I can't decide which approach (or another approach) is better to use. What do you think?

Kazuki Nishikawa · Answer 10 · Thu Oct 21 2021 16:41:51 GMT+0800 (China Standard Time)

...Or should we move from monit to systemd for puma? 😂

Igor Rzegocki · Answer 11 · Thu Oct 21 2021 22:15:22 GMT+0800 (China Standard Time)

Going from monit to systemd would probably break a lot - especially, because opsworks is not maintained very well by amazon (to say the least), and I'm not sure how old systemd versions are there :)

I like conditional restart/start approach better, because then we are not required to remove/detect restarts in workers. @olbrich what do you think?