Erlang Port Exhaustion

Question

Erlang Port Exhaustion

kamilkowalski opened this issue 4 years ago · comments

When sending metrics over UDP fails, TelemetryMetricsStatsd opens a fresh socket to send the metrics over. I'm assuming this is to handle some transient errors that are tied to that specific socket. However, the old socket is not being closed before the new one gets opened, resulting in port leakage and leading to an exhaustion of ports in the VM.

Combined with the Unix domain socket support I've implemented, and given the fact that :gen_udp.send/4 returns an error if the Unix domain socket is not available, renders the following chart from our production systems last week, where port counts skyrocketed from 8k to 126k in a matter of minutes throughout our infrastructure (we were doing a rollout of the Datadog agent, which made the sockets unavailable for a moment):

To reproduce the issue:

Configure TelemetryMetricsStatsd with socket_path pointing to something that's not a Unix domain socket.
Open :observer.
Send some metrics over.
Observe the port count increasing with each error logged.

The issue remains even when using regular UDP - it's just easier to reproduce with Unix domain sockets.

If I understand the socket re-opening correctly, we can close the old socket as soon as the new one is opened. I'll open a PR with a fix soon.

Kamil Kowalski · Answer 1 · Tue Jul 21 2020 04:05:35 GMT+0800 (China Standard Time)

@arkgil can you tell me why we're re-opening the socket when a UDP error occurs? Now that I'm reading into it, I'm not sure we should.

Arek Gil · Answer 2 · Tue Jul 21 2020 04:31:07 GMT+0800 (China Standard Time)

@kamilkowalski thanks for catching this! We do this just to be on the safe side - but we should absolutely close the old socket.