Timeout on registering a metric
jayashe opened this issue · comments
Seeing a genserver timeout via ensure_registered
. What's weird is that genserver message ({:subscribe,...}
) should only be called when the metric hasn't been registered before, but the metric should be registered because we have plenty of data points for the metric in question within the same session. This starts to happen when the system is under heavy load (and after the system has been running for several hours).
Trace:
GenServer Elixometer.Updater terminating
** (stop) exited in: GenServer.call(Elixometer, {:subscribe, ["my_app_prefix", "timers", "namespace", "of", "my", "module", "function"]}, 5000)
** (EXIT) time out
(elixir) lib/gen_server.ex:924: GenServer.call/3
(elixometer) lib/elixometer.ex:372: Elixometer.ensure_registered/2
(elixometer) lib/updater.ex:108: Elixometer.Updater.do_update/2
(elixir) lib/enum.ex:765: Enum.-each/2-lists^foreach/1-0-/2
(elixir) lib/enum.ex:765: Enum.each/2
(elixometer) lib/updater.ex:45: Elixometer.Updater.handle_info/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
Any thoughts on what could cause this?
Are you seeing any other crashes in your logs? Without more information, it's very hard to debug. Do you have any SASL or OTP reports?
How heavy is the load you're seeing? Elixometer depends on a single GenServer, for subscriptions which can be a bottleneck.
@scohen thanks for quick response. Should have a full crash dump tomorrow that should provide more details (assuming we see the crash tomorrow at the peak load time). Will report back
The fact that you're seeing resubscriptions happen seems to indicate that elixometer's genserver crashed and is getting backed up, which is why the handle_call
timeout is being reached
Just my guess though; I've never seen it crash in production; it should be able to handle tens to a hundred thousand messages a second.