Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources
rhr323 opened this issue · comments
Issue Summary
In our testing on the serverless platform, we aimed to assess the maximum number of projects that can be supported on a single MKI cluster. We were using the Elastic Agent version 8.15-4-SNAPSHOT to mitigate previously identified memory issues.
Most Elastic Agent instances functioned without issues. However, on nodes hosting vector search projects, where a larger number of Elasticsearch instances and their associated Kubernetes resources (e.g., pods, deployments, services, secrets) are allocated, we observed the Elastic Agent running out of memory. This typically occurred when these nodes were hosting around 100 Elasticsearch instances.
Observed Behavior
- Elastic Agent on high-density nodes (around 100 Elasticsearch instances) experienced memory exhaustion and got stuck in a crash loop.
- Diagnostic data was collected from an Elastic Agent on a node with ~70 allocated projects at the time of capture.
Environment
- Elastic Agent version: 8.15-4-SNAPSHOT
- Kubernetes environment: Serverless platform, MKI cluster
- Node allocation: ~100 Elasticsearch instances per node for vector search projects
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Having looked at the diagnostics and telemetry for one of the agent Pods encountering this problem, my initial thoughts are as follows:
- The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that.
- There's a lot of configuration churn on the elastic-agent side caused by the kubernetes variable provider, and that's most likely what leads to the agent itself using more memory than expected. I'm tackling that in #5835 .
- In beats themselves, needing to reload configuration frequently also adds up to additional memory consumption and possibly other kinds of disruption (If we have a scraper that is supposed to fetch metrics every 10 seconds, but we reload config every 5 seconds, then we're effectively scraping every 5 seconds instead). If the beats config manager could avoid restarting units it doesn't need to restart, this effect could be mitigated.
@elastic/stack-monitoring do we have a synthetic benchmark for the elasticsearch metricbeat module? It would help a lot, as reproducing the environment this issue came up in is a huge pain.
@swiatekm Honestly not to my knowledge, but since we've taken over SM very recently, maybe @miltonhultgren has a more insightful answer to give you.
👋🏼
- As far as I know, we don't have any benchmarking (synthetic or otherwise) for the Stack Monitoring modules
- "The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that." this, a 100 times this. We are well aware of a few patterns in those modules that really use more memory than is needed, a lot of time is spent shuffling JSON in and out of very ineffective structures. It's honestly good that we don't have benchmarks because they would be horrible and in the past we didn't ever have the resources to try to optimize this. Much is likely low hanging fruit in terms of complexity, it's just the time effort.