Consider syncing changes from https://github.com/ieQu1/system_monitor

Question

Consider syncing changes from https://github.com/ieQu1/system_monitor

ieQu1 opened this issue 2 years ago · comments

Hello,

We've made some significant performance optimizations in this application mostly to support systems with much larger number of processes.

Size of the delta record has been shrunk to reduce memory usage (now it only stores the pid, prev. reductions and memory, everything else is derived from the runtime data).
Postgres operations are batched
Data collection has been moved to a separate process to avoid blocking the server with requests for data
Simplified sampling algorithm
Improved postgres schema for application and function (non-BWC change)

Improvements:

"Very Important Processes": always collect metrics for a configurable list of registered processes
Added automatic tests

Andreas Hasselberg · Answer 1 · Fri Jan 20 2023 15:00:21 GMT+0800 (China Standard Time)

Hello,

I'm sorry for not getting back to you sooner. I've taken a quick look at your repo, and the changes look nice.
Let's start with the batching of Postgres operations (ieQu1/system_monitor#9). Was there a problem with the buffer in system_mointor_pg?

ieQu1 · Answer 2 · Thu Jan 26 2023 18:02:07 GMT+0800 (China Standard Time)

Hello,

We tried running the original version in AWS and observed multiple instances of uncontrollable queue growth. We narrowed the problem down to this line: https://github.com/ieQu1/system_monitor/pull/9/files#diff-082e02f4b708d7d67c27819cfb3cfc8ebd48b3140671fe51eefbaf454ceaac5aL100
The issue boils down to network latency: in the original implementation the round trip time is added to each individual request, so time to insert N entries is (t_q + 2 t_n) * N where t_q is the time Postgres spends executing the actual query (it's miniscule) and t_n is network latency.
In the updated version the total time to insert N entries is just t_q * N + 2 t_n. This solved our issue.

Andreas Hasselberg · Answer 3 · Fri Jan 27 2023 17:10:59 GMT+0800 (China Standard Time)

I see the benefits of batching the Postgres requests and would like to merge that into our repo. When looking at the PR where you introduced Postgres batching, I noticed that you removed a buffer. The buffer was meant to hold some data when the Postgres connections didn't have a connection to Postgres.
The reason we introduced this was that we would got gaps in the system monitor data when the Postgres connection was down. These gaps occurred during network instabilities at our cloud provider, which is a time when system monitor data is extra valuable.
To this day, we have not noticed any problems with the buffer. Did you? Do you think we can include the buffering of postgres requests without removing the buffer that hold data when the postgres connection is down?

ieQu1 · Answer 4 · Fri Jan 27 2023 18:30:40 GMT+0800 (China Standard Time)

Sorry, I misunderstood the question then. If I remember correct, we decided that it's better to lose some datapoints than to be OOM-killed when the queue grows (since the queue is in RAM, and our system is sometimes deployed on very low-spec servers with little RAM).
We didn't run a separate test with queue and batching, it was all changed in a single commit.

Andreas Hasselberg · Answer 5 · Sun Jan 29 2023 04:52:48 GMT+0800 (China Standard Time)

I've started to sync the changes here: #20
Let me know what you think. I'll continue with the smaller delta record when I'm back at work.

ieQu1 · Answer 6 · Mon Jan 30 2023 00:37:19 GMT+0800 (China Standard Time)

P.S. If you want to revert the change that removed the queue, WDYT about using https://github.com/emqx/replayq ? It can offload data to disk and maintain a limit on the buffer size.

Andreas Hasselberg · Answer 7 · Mon Jan 30 2023 18:18:21 GMT+0800 (China Standard Time)

The intention is that the buffer should not grow indefinitely. It has a max size that can be configured using an app env. As we haven't seen any problems with the buffer, I will not spend time fixing it. If you want to make a PR to start using replayq, I'll look at it.

Andreas Hasselberg · Answer 8 · Mon Feb 06 2023 15:10:03 GMT+0800 (China Standard Time)

Thanks for highlighting the changes in your fork of this repo. I am interested in some of the other stuff you have changed, but I need more time to pick these things out of your repo.
In the future, would it be possible for you to pull request changes that you would like to get into this repo?