pinterest / elixometer

A light Elixir wrapper around exometer.

Home Page:https://hexdocs.pm/elixometer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timeout on registering a metric

jayashe opened this issue · comments

Seeing a genserver timeout via ensure_registered. What's weird is that genserver message ({:subscribe,...}) should only be called when the metric hasn't been registered before, but the metric should be registered because we have plenty of data points for the metric in question within the same session. This starts to happen when the system is under heavy load (and after the system has been running for several hours).

Trace:

GenServer Elixometer.Updater terminating
** (stop) exited in: GenServer.call(Elixometer, {:subscribe, ["my_app_prefix", "timers", "namespace", "of", "my", "module", "function"]}, 5000)
** (EXIT) time out
(elixir) lib/gen_server.ex:924: GenServer.call/3
(elixometer) lib/elixometer.ex:372: Elixometer.ensure_registered/2
(elixometer) lib/updater.ex:108: Elixometer.Updater.do_update/2
(elixir) lib/enum.ex:765: Enum.-each/2-lists^foreach/1-0-/2
(elixir) lib/enum.ex:765: Enum.each/2
(elixometer) lib/updater.ex:45: Elixometer.Updater.handle_info/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6

Any thoughts on what could cause this?

Are you seeing any other crashes in your logs? Without more information, it's very hard to debug. Do you have any SASL or OTP reports?

How heavy is the load you're seeing? Elixometer depends on a single GenServer, for subscriptions which can be a bottleneck.

@scohen thanks for quick response. Should have a full crash dump tomorrow that should provide more details (assuming we see the crash tomorrow at the peak load time). Will report back

The fact that you're seeing resubscriptions happen seems to indicate that elixometer's genserver crashed and is getting backed up, which is why the handle_call timeout is being reached

Just my guess though; I've never seen it crash in production; it should be able to handle tens to a hundred thousand messages a second.