klarna / system_monitor

BEAM VM telemetry collector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider syncing changes from https://github.com/ieQu1/system_monitor

ieQu1 opened this issue · comments

commented

Hello,

We've made some significant performance optimizations in this application mostly to support systems with much larger number of processes.

  • Size of the delta record has been shrunk to reduce memory usage (now it only stores the pid, prev. reductions and memory, everything else is derived from the runtime data).
  • Postgres operations are batched
  • Data collection has been moved to a separate process to avoid blocking the server with requests for data
  • Simplified sampling algorithm
  • Improved postgres schema for application and function (non-BWC change)

Improvements:

  • "Very Important Processes": always collect metrics for a configurable list of registered processes
  • Added automatic tests

Hello,

I'm sorry for not getting back to you sooner. I've taken a quick look at your repo, and the changes look nice.
Let's start with the batching of Postgres operations (ieQu1/system_monitor#9). Was there a problem with the buffer in system_mointor_pg?

commented

Hello,

We tried running the original version in AWS and observed multiple instances of uncontrollable queue growth. We narrowed the problem down to this line: https://github.com/ieQu1/system_monitor/pull/9/files#diff-082e02f4b708d7d67c27819cfb3cfc8ebd48b3140671fe51eefbaf454ceaac5aL100
The issue boils down to network latency: in the original implementation the round trip time is added to each individual request, so time to insert N entries is (t_q + 2 t_n) * N where t_q is the time Postgres spends executing the actual query (it's miniscule) and t_n is network latency.
In the updated version the total time to insert N entries is just t_q * N + 2 t_n. This solved our issue.

I see the benefits of batching the Postgres requests and would like to merge that into our repo. When looking at the PR where you introduced Postgres batching, I noticed that you removed a buffer. The buffer was meant to hold some data when the Postgres connections didn't have a connection to Postgres.
The reason we introduced this was that we would got gaps in the system monitor data when the Postgres connection was down. These gaps occurred during network instabilities at our cloud provider, which is a time when system monitor data is extra valuable.
To this day, we have not noticed any problems with the buffer. Did you? Do you think we can include the buffering of postgres requests without removing the buffer that hold data when the postgres connection is down?

commented

Sorry, I misunderstood the question then. If I remember correct, we decided that it's better to lose some datapoints than to be OOM-killed when the queue grows (since the queue is in RAM, and our system is sometimes deployed on very low-spec servers with little RAM).
We didn't run a separate test with queue and batching, it was all changed in a single commit.

I've started to sync the changes here: #20
Let me know what you think. I'll continue with the smaller delta record when I'm back at work.

commented

P.S. If you want to revert the change that removed the queue, WDYT about using https://github.com/emqx/replayq ? It can offload data to disk and maintain a limit on the buffer size.

The intention is that the buffer should not grow indefinitely. It has a max size that can be configured using an app env. As we haven't seen any problems with the buffer, I will not spend time fixing it. If you want to make a PR to start using replayq, I'll look at it.

Thanks for highlighting the changes in your fork of this repo. I am interested in some of the other stuff you have changed, but I need more time to pick these things out of your repo.
In the future, would it be possible for you to pull request changes that you would like to get into this repo?