Plug.Cowboy.Drainer shut down abnormally

Question

Plug.Cowboy.Drainer shut down abnormally

dantswain opened this issue a year ago · comments

Hi!

My app has been shutting down seemingly randomly. This only happens on my local machine (not on other devs, not in production environments) so I suspect it has something to do with my set up.

There are no logs when my app shuts down. It doesn't seem to correlate to any particular behavior. This has been happening for at least a couple weeks with regular daily development happening.

I finally got a clue by enabling SASL logging:

[error] Child Plug.Cowboy.Drainer of Supervisor MyApp.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.1011.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker

[notice] Application my_app exited: shutdown

Based on looking at what Plug.Cowboy.Drainer does, this seems like the error could be a red herring? Any ideas how I could get more diagnostic information? I highly suspect this is a memory limit or running out of file descriptors or something else specific to my environment but I can't seem to get any more clues.

I'm using plug_cowboy 2.6.0 with Phoenix 1.7.0-rc.2 on Elixir 1.14.3, OTP 25

José Valim · Answer 1 · Wed Feb 22 2023 23:20:21 GMT+0800 (China Standard Time)

Can you try running with config :logger, handle_sasl_reports: true? My suspicion is that another child in your supervision tree is crashing causing the tree to restart. During restart, it has to shut down the current children. Cowboy cannot drain everything in 5000ms (probably because a browser is open) and then it is killed after 5 seconds.

Dan Swain · Answer 2 · Wed Feb 22 2023 23:31:16 GMT+0800 (China Standard Time)

This is with config :logger, handle_sasl_reports: true 😬

iex(4)> Application.get_env(:logger, :handle_sasl_reports)
true

The log message just before this is (in the most recent crash) [info] Sent 200 in 36ms. Open to trying anything else!

José Valim · Answer 3 · Wed Feb 22 2023 23:43:03 GMT+0800 (China Standard Time)

Hrm, that's very weird then! What could be killing it? Maybe you can play with Erlang's dbg or tracer and trace the messages between processes?

Dan Swain · Answer 4 · Wed Feb 22 2023 23:58:50 GMT+0800 (China Standard Time)

I agree!

My hunch is I'm running out of file descriptors or something at a system level? I don't know how else it would be so quiet...

I've attached a tracer to my drainer process, so maybe that will show something. I don't think I can get away with tracing everything since it can be hours between crashes and I need to be able to work in the meantime 😓

Dan Swain · Answer 5 · Thu Feb 23 2023 00:47:52 GMT+0800 (China Standard Time)

OK that got a little more info!

(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83170,904349})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83170,904389})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181956>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83170,
                                                                                 904398})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83170,904448})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83171,905187})
(<0.763.0>) << timeout (Timestamp: {1677,83171,905236})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181957>},
                                     count_children} (Timestamp: {1677,83171,
                                                                  905324})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83171,905357})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83171,905403})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181957>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83171,
                                                                                 905417})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83171,905467})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,906176})
(<0.763.0>) << timeout (Timestamp: {1677,83172,906224})
(<0.763.0>) <0.661.0> ! {'$gen_call',{<0.763.0>,
                                      #Ref<0.23370919.4183031816.181958>},
                                     count_children} (Timestamp: {1677,83172,
                                                                  906282})
(<0.763.0>) out {gen,do_call,4} (Timestamp: {1677,83172,906313})
(<0.763.0>) in {gen,do_call,4} (Timestamp: {1677,83172,906356})
(<0.763.0>) << {#Ref<0.23370919.4183031816.181958>,
                [{specs,1},{active,4},{supervisors,0},{workers,4}]} (Timestamp: {1677,
                                                                                 83172,
                                                                                 906368})
(<0.763.0>) out {timer,sleep,1} (Timestamp: {1677,83172,906414})
(<0.763.0>) in {timer,sleep,1} (Timestamp: {1677,83172,911083})
(<0.763.0>) exit killed (Timestamp: {1677,83172,911114})
(<0.763.0>) out_exited 0 (Timestamp: {1677,83172,911131})
[error] Child Plug.Cowboy.Drainer of Supervisor MyAppWeb.Endpoint shut down abnormally
** (exit) killed
Pid: #PID<0.763.0>
Start Call: Plug.Cowboy.Drainer.start_link([refs: [MyAppWeb.Endpoint.HTTP]])
Restart: :permanent
Shutdown: 5000
Type: :worker
[notice] Application my_app exited: shutdown

There's more before that but the pattern repeats. I know this isn't always necessarily repeatable but if I restart the application and get the same pid (<0.763.0>) for the drainer then the process that sent the message, <0.661.0>, appears to be ranch_cons_sup.

José Valim · Answer 6 · Thu Feb 23 2023 00:52:42 GMT+0800 (China Standard Time)

Hrm, this is weird because ranch has no idea about this process unless we are shutting it down!

Dan Swain · Answer 7 · Thu Feb 23 2023 00:59:30 GMT+0800 (China Standard Time)

Am I reading the trace correctly that the drainer is sending {$gen_call', {... count_children}} to <0.661.0> or is it the other way around? It looks like the drainer could be doing that here: https://github.com/elixir-plug/plug_cowboy/blob/master/lib/plug/cowboy/drainer.ex#L103-L105

José Valim · Answer 8 · Thu Feb 23 2023 01:14:34 GMT+0800 (China Standard Time)

Right, but the point is that we should not be in that piece of the code unless the drainer started to terminate!

Dan Swain · Answer 9 · Thu Feb 23 2023 03:06:45 GMT+0800 (China Standard Time)

I'm trying to catch it in the act but I think it might be getting triggered by or during the automatic rebuild cycle. Does that ring any bells?

Is the trace saying that the call is timing out? What might cause that to timeout? It might hint at the root problem.

I'll keep digging here but it's sort of hard because it's very distracting compared to the mild inconvenience of having to manually restart periodically 😂 Definitely open to ideas of things to try.

Dan Swain · Answer 10 · Fri Feb 24 2023 03:54:48 GMT+0800 (China Standard Time)

OK I think I've actually tracked my problem down to another source - joken-elixir/joken_jwks#40

I'm wondering, though, if what I've seen here might hint at a problem in plug_cowboy? The traces make it look like a timeout is happening and then the drainer exits abnormally.

José Valim · Answer 11 · Fri Feb 24 2023 07:07:59 GMT+0800 (China Standard Time)

Right, that was my point since the beginning. This is definitely a secondary failure, something is making your app fail before, but I am not sure why you can't see the first failure! I am not sure if there is a bug here, for all purposes, the shutdown/kill is happening for correct reasons.

Einar Engström · Answer 12 · Mon Jul 24 2023 23:29:51 GMT+0800 (China Standard Time)

I encountered this issue when a project didn't have the .env file used in the original dev environment.

nb. this happened immediately when running mix phx.server, unlike np who seemed to see this after deployment.