Slow workers respawn due to race condition in 4.x
sitano opened this issue · comments
Describe the bug
When something issues a worker restart in 4.x it takes up to Const::WORKER_CHECK_INTERVAL
(=5 seconds by default) for the manager process to respawn a worker. That is too much time for a worker to be absent under load.
To Reproduce
Kill a worker. Observe its absence up to Const::WORKER_CHECK_INTERVAL with ps aux
.
Expected behavior
Respawn a worker as soon as the parent observed SIGCHLD.
** Example **
I, [2022-12-22T17:22:52.304877 #89305] INFO : Worker idle timeout of 15 reached. Exiting... pid=89305
I, [2022-12-22T17:22:52.305826 #88491] INFO : spawned <-- wasted check_workers spawn
... nothing happens here....
I, [2022-12-22T17:22:57.318201 #88491] INFO : before spawn <-- next round
I, [2022-12-22T17:22:57.325574 #88491] INFO : forked
I, [2022-12-22T17:22:57.325631 #88491] INFO : hooks
I, [2022-12-22T17:22:57.325647 #88491] INFO : done
I, [2022-12-22T17:22:57.325678 #88491] INFO : spawned
I, [2022-12-22T17:22:57.328877 #89334] INFO : Server queue_requests=true, idle_timeout=10 pid=89334
[88491] - Worker 0 (pid: 89334) booted, phase: 0
Reason
Race condition in between receiving "t" command, the worker process actually exiting (SIGCHLD), and the parent checking over dead workers. Specifically that:
when "t"
w.term unless w.term?
force_check = true
races with check_workers force_check
and with receiving SIGCHLD
such that the check_workers can't detect the worker exit as it's not finished yet and just shots into the blue.
How to fix
Backport this patch 67f9b1f.
If I may suggest my help here, I am all yours to do that. Just let me know what you think.
I am not sure this version (4.x) is supported just posting it here to let you know.
Only last two versions are supported: https://github.com/puma/puma/blob/master/SECURITY.md
@dentarg ok. thank you for the info. then I am closing this one.