Monit always waits for puma to complete restart until its timeout after deployment
kzkn opened this issue · comments
I use v1.20.3 and puma and delayed_jobs.
Expected Behavior
Wait for puma to finish starting, not until its timeout.
Actual Behavior
Monit always waits for puma to complete restart until its timeout (90 seconds).
This will cause a delay in the subsequent restarting delayed_jobs processes.
Oct 14 17:41:25 myapp-server monit[920]: 'myapp-server' Monit reloaded
Oct 14 17:42:01 myapp-server CRON[27137]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:42:01 myapp-server CRON[27138]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:42:01 myapp-server CRON[27137]: pam_unix(cron:session): session closed for user root
Oct 14 17:42:15 myapp-server sshd[27049]: Connection reset by 222.187.238.58 port 38644 [preauth]
Oct 14 17:42:49 myapp-server crontab[27235]: (deploy) LIST (deploy)
Oct 14 17:42:50 myapp-server crontab[27236]: (deploy) REPLACE (deploy)
Oct 14 17:42:50 myapp-server monit[27237]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' restart on user request
Oct 14 17:42:50 myapp-server monit[920]: Monit daemon with PID 920 awakened
Oct 14 17:42:50 myapp-server monit[920]: Awakened by User defined signal 1
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' trying to restart
Oct 14 17:42:50 myapp-server monit[920]: 'puma_rails' stop: '/bin/sh -c cat /run/lock/rails/puma.pid | xargs --no-run-if-empty kill -QUIT; sleep 5'
Oct 14 17:42:50 myapp-server monit[27244]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 14 17:42:50 myapp-server monit[920]: 'delayed_job_rails-1' restart on user request
Oct 14 17:42:50 myapp-server monit[920]: Monit daemon with PID 920 awakened
Oct 14 17:42:50 myapp-server systemd[1]: Reloading nginx - high performance web server.
-- Subject: Unit nginx.service has begun reloading its configuration
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit nginx.service has begun reloading its configuration
Oct 14 17:42:50 myapp-server systemd[1]: Reloaded nginx - high performance web server.
-- Subject: Unit nginx.service has finished reloading its configuration
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit nginx.service has finished reloading its configuration
--
-- The result is RESULT.
Oct 14 17:42:50 myapp-server sudo[26637]: pam_unix(sudo:session): session closed for user root
Oct 14 17:42:55 myapp-server monit[920]: 'puma_rails' start: '/bin/sh -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] Puma starting in cluster mode...
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Puma version: 5.5.2 (ruby 2.7.4-p191) ("Zawgyi")
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Min threads: 0
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Max threads: 16
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Environment: staging
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Master PID: 27280
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Workers: 2
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Restarts: (✔) hot (✖) phased
Oct 14 17:42:55 myapp-server puma-rails[27281]: [27280] * Preloading application
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] * Listening on unix:///srv/www/rails/shared/sockets/puma.sock
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] ! WARNING: Detected 1 Thread(s) started in app boot:
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] ! #<Rack::MiniProfiler::MemoryStore::CacheCleanupThread:0x000055fed3e38ca8 /srv/www/rails/shared/vendor/bundle/ruby/2.7.0/g
Oct 14 17:43:00 myapp-server puma-rails[27281]: [27280] Use Ctrl-C to stop
Oct 14 17:43:01 myapp-server cron[12893]: (deploy) RELOAD (crontabs/deploy)
Oct 14 17:43:01 myapp-server CRON[27313]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:43:01 myapp-server CRON[27314]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:43:01 myapp-server CRON[27313]: pam_unix(cron:session): session closed for user root
Oct 14 17:44:01 myapp-server CRON[27345]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 14 17:44:01 myapp-server CRON[27346]: (root) CMD (flock --nonblock /var/lib/aws/opsworks/lockrun.lock /opt/aws/opsworks/current/bin/opsworks-agent-updater)
Oct 14 17:44:01 myapp-server CRON[27345]: pam_unix(cron:session): session closed for user root
Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' failed to start (exit status -1) -- Program '/bin/sh -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' restart action failed
Oct 14 17:44:25 myapp-server monit[920]: 'delayed_job_rails-1' trying to restart
Oct 14 17:44:25 myapp-server monit[920]: 'delayed_job_rails-1' stop: '/bin/su - deploy -c cd /srv/www/rails/current && MYENV=...
Oct 14 17:44:25 myapp-server su[27358]: Successful su for deploy by root
Oct 14 17:44:25 myapp-server su[27358]: + ??? root:deploy
Oct 14 17:44:25 myapp-server su[27358]: pam_unix(su:session): session opened for user deploy by (uid=0)
Custom JSON
"appserver": {
"dot_env": true,
"worker_processes": 2
},
"worker": {
"adapter": "delayed_job",
"process_count": 1
}
Chef log including error
@olbrich Does the fix you created recently applies here (I mean, does it resolve the issue)? Or is it something unrelated?
@ajgon I don't think I did anything that would explicitly fix this, but it's possible that it improved because the monit handling is a bit better now. I would recommend trying to reproduce with the latest version.
Also the line
Oct 14 17:44:25 myapp-server monit[920]: 'puma_rails' restart action failed
In the output makes me thing that the deploy is failing, so I'm not entirely sure what happens in that situation. I think we stop workers before a deploy and it may be that monit eventually figures out that the process should be running and tries to restart it on a failed deploy.
I will try v1.21.1, thank!
In the output makes me thing that the deploy is failing, so I'm not entirely sure what happens in that situation.
In my case, the deploy is not failing. The deploy is success, puma and delayed_job processes are running correctly.
I tried v1.21.1, I got fail to start puma process on create new instance 😢
Outputs:
ubuntu@myapp-server:~$ sudo monit summary
┌─────────────────────────────────┬────────────────────────────┬───────────────┐
│ Service Name │ Status │ Type │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ myapp-server │ OK │ System │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ puma_rails │ Execution failed | Does... │ Process │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent │ OK │ Process │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-master-running │ OK │ Process │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ delayed_job_rails-1 │ OK │ Process │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-statistic-dae... │ OK │ File │
├─────────────────────────────────┼────────────────────────────┼───────────────┤
│ opsworks-agent-keep-alive-da... │ OK │ File │
└─────────────────────────────────┴────────────────────────────┴───────────────┘
ubuntu@myapp-server:~$ sudo monit status puma_rails
Process 'puma_rails'
status Execution failed | Does not exist
monitoring status Monitored
monitoring mode active
on reboot start
data collected Mon, 18 Oct 2021 12:37:44
ubuntu@myapp-server:~$ sudo journalctl | grep puma_rails
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' process is not running
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' trying to restart
Oct 18 21:33:58 myapp-server monit[961]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...'
Oct 18 21:35:28 myapp-server monit[961]: 'puma_rails' failed to restart (exit status 2) -- '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' start on user request
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' process is not running
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' trying to restart
Oct 18 21:36:13 myapp-server monit[961]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...'
Oct 18 21:37:44 myapp-server monit[961]: 'puma_rails' failed to restart (exit status 0) --
Oct 18 21:38:44 myapp-server monit[961]: 'puma_rails' process is not running
ubuntu@myapp-server:~$ ps auxw|grep puma
ubuntu 26696 0.0 0.0 14860 1056 pts/0 S+ 21:47 0:00 grep --color=auto puma
ubuntu@myapp-server:~$ cat /run/lock/rails/puma.pid
cat: /run/lock/rails/puma.pid: No such file or directory
Do the puma logs give you any indication of what the failure was?
@olbrich
The puma log file did not exist. Is there anything else I should look at?
# puma.rb
pidfile "/run/lock/rails/puma.pid"
state_path "/run/lock/rails/puma.state"
stdout_redirect "/srv/www/rails/shared/log/puma.stdout.log", "/srv/www/rails/shared/log/puma.stderr.log", true
deploy@myapp-server:~$ cat /run/lock/rails/puma.pid
cat: /run/lock/rails/puma.pid: No such file or directory
deploy@myapp-server:~$ cat /run/lock/rails/puma.state
cat: /run/lock/rails/puma.state: No such file or directory
deploy@myapp-server:~$ cat /srv/www/rails/shared/log/puma.stdout.log
cat: /srv/www/rails/shared/log/puma.stdout.log: No such file or directory
deploy@myapp-server:~$ cat /srv/www/rails/shared/log/puma.stderr.log
cat: /srv/www/rails/shared/log/puma.stderr.log: No such file or directory
pumactl restart
is failed because puma.state file does not exists.
https://github.com/puma/puma/blob/a6cdca85a15b2c277f64a5fa7572effb786fa9a1/lib/puma/control_cli.rb#L148
Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' process is not running
Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' trying to restart
Oct 19 21:48:34 myapp-server monit[923]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 19 21:48:34 myapp-server puma-rails[15131]: State file not found: /run/lock/rails/puma.state
It looks like the monit start
on the first deploy is not executed.
When I manually run sudo monit start puma_rails
, I have seen the following logs:
Oct 19 22:00:07 myapp-server monit[923]: 'puma_rails' start: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Puma will run successfully after the above log. puma.pid, puma.state files are created.
sudo journalctl | grep "'puma_rails' start:"
hits only logs that ran monit start
manually.
This seems to be similar to the problem we addressed in #251.
Oct 19 23:03:54 myapp-server monit[960]: 'myapp-server' Monit reloaded
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' process is not running
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' trying to restart
Oct 19 23:03:54 myapp-server monit[960]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
Oct 19 23:05:24 myapp-server monit[960]: 'puma_rails' failed to restart (exit status 0) -- '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
...
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' start on user request
Oct 19 23:06:07 myapp-server monit[960]: Monit daemon with PID 960 awakened
Oct 19 23:06:07 myapp-server monit[960]: Awakened by User defined signal 1
Oct 19 23:06:07 myapp-server monit[960]: Reinitializing Monit -- control file '/etc/monit/monitrc'
Oct 19 23:06:07 myapp-server monit[25881]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 19 23:06:07 myapp-server monit[960]: 'delayed_job_rails-1' restart on user request
Oct 19 23:06:07 myapp-server monit[960]: Monit daemon with PID 960 awakened
Oct 19 23:06:07 myapp-server monit[960]: Skipping 'allow localhost' -- host resolved to [::ffff:127.0.0.1] which is present in ACL already
Oct 19 23:06:07 myapp-server monit[960]: 'myapp-server' Monit reloaded
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' process is not running
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' trying to restart
Oct 19 23:06:07 myapp-server monit[960]: 'puma_rails' restart: '/bin/sh -c cd /srv/www/rails/current && AWS_ACCESS_KEY_ID=...
The monit start puma_rails
is requested, but actually monit restart
is working instead of monit start
.
It looks like to be cancelled because it is Reinitializing Monit
immediately after requesting monit start
.
After that, it looks like monit is repeatedly monit restart
by itself without the puma.state
file.
I assume this is because monit reload
twice, once for puma and once for delayed_job.
The monit reload
is now called immediately via notifies
from the configure hook of appserver and worker.
When I made the change to call monit reload
only once, the problem goes away.
Another approach that worked well was to change the command set for the restart program
in puma's monitrc, which dynamically switches between pumactl start
and pumactl restart
depending on the presence of the puma.state
file. See this diff for details.
@ajgon @olbrich
I can't decide which approach (or another approach) is better to use. What do you think?
...Or should we move from monit to systemd for puma? 😂
Going from monit to systemd would probably break a lot - especially, because opsworks is not maintained very well by amazon (to say the least), and I'm not sure how old systemd
versions are there :)
I like conditional restart/start approach better, because then we are not required to remove/detect restarts in workers. @olbrich what do you think?