Puma workers are freezing

Question

Puma workers are freezing

m3nd3s opened this issue 2 years ago · comments

Describe the bug

Let's start with this picture:

I have a Rails application hosted at AWS using ECS (Elastic Container Service) - docker. The Puma was configured to work with 3 workers and each worker with 8 threads.

For some reason sometimes the workers are freezing, they became unresponsive for something like 2min 30seconds until the puma detects the timeout and sends the kill signals to stop all the workers.

So I would like to ask for help to identify what can cause this kind of problem? Is there any way to debug it?

There is another behavior, I'm not sure if the puma stops receiving the requests or if it still keeps receiving the requests, but after the puma timeout and the signal to kill, there are a lot (hundreds) of requests failing on Nginx (Puma is behind of Nginx) with 429 HTTP status code.

Puma config:

The puma is configured with 3 workers, 8 threads (min and max)

To Reproduce

I don't know how to reproduce, actually, identifying how to reproduce it can be very helpful.

Expected behavior

Desktop (please complete the following information):

OS: Linux (via docker)
Puma Version: 5.5.2

Patrik Ragnarsson · Answer 1 · Tue May 31 2022 05:58:04 GMT+0800 (China Standard Time)

after the puma timeout and the signal to kill, there are a lot (hundreds) of requests failing on Nginx (Puma is behind of Nginx) with 429 HTTP status code.

That sounds like the requests in the socket backlog, see https://github.com/puma/puma/blob/4ac14482f1eda4bcf2d2baa3a379afe3f5b55a9c/docs/architecture.md

Patrik Ragnarsson · Answer 2 · Tue May 31 2022 05:59:45 GMT+0800 (China Standard Time)

For some reason sometimes the workers are freezing, they became unresponsive for something like 2min 30seconds until the puma detects the timeout and sends the kill signals to stop all the workers. So I would like to ask for help to identify what can cause this kind of problem? Is there any way to debug it?

Maybe using https://github.com/zombocom/rack-timeout could give you some clues where this is happening

Almir Mendes · Answer 3 · Tue May 31 2022 06:11:59 GMT+0800 (China Standard Time)

after the puma timeout and the signal to kill, there are a lot (hundreds) of requests failing on Nginx (Puma is behind of Nginx) with 429 HTTP status code.
That sounds like the requests in the socket backlog, see https://github.com/puma/puma/blob/4ac14482f1eda4bcf2d2baa3a379afe3f5b55a9c/docs/architecture.md

Thanks for helping.

Actually, the Nginx is configured to communicate with Puma via the TCP connection:

upstream app {
  server app:3000;
}

Patrik Ragnarsson · Answer 4 · Tue May 31 2022 07:19:54 GMT+0800 (China Standard Time)

You still have a backlog

gingerlime · Answer 5 · Thu Jun 02 2022 03:19:50 GMT+0800 (China Standard Time)

Not sure if it helps, but I was playing around with puma for testing purposes and also bumped into this. What seemed to have helped (quite a lot) in my case was to clear the tmp/cache folder.

Almir Mendes · Answer 6 · Thu Jun 02 2022 19:53:55 GMT+0800 (China Standard Time)

@gingerlime Maybe, I'll take a look.

Is there any way to verify if a puma worker is stuck or frozen?

Nate Berkopec · Answer 7 · Fri Jun 03 2022 02:17:28 GMT+0800 (China Standard Time)

Is there any way to verify if a puma worker is stuck or frozen?

...ask it to check in via a pipe every 60 seconds? 😆

If it's not checking in via the checkpipe, then something is pretty wrong. It's also interesting that all 3 workers lock up at the same time.

So what resources do all three of those workers share?

CPU
Memory (though this one seems unlikely to me, if you were out of of memory the OOM killer would probably just blast them first)
Something application specific? It would need to hold the global VM lock, otherwise our checkin would succeed.

In any case, probably not a puma issue but something with your application or setup. Good luck!