Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

rhr323 · 2024-11-11T13:21:30Z

Issue Summary

In our testing on the serverless platform, we aimed to assess the maximum number of projects that can be supported on a single MKI cluster. We were using the Elastic Agent version 8.15-4-SNAPSHOT to mitigate previously identified memory issues.

Most Elastic Agent instances functioned without issues. However, on nodes hosting vector search projects, where a larger number of Elasticsearch instances and their associated Kubernetes resources (e.g., pods, deployments, services, secrets) are allocated, we observed the Elastic Agent running out of memory. This typically occurred when these nodes were hosting around 100 Elasticsearch instances.

Observed Behavior

Elastic Agent on high-density nodes (around 100 Elasticsearch instances) experienced memory exhaustion and got stuck in a crash loop.
Diagnostic data was collected from an Elastic Agent on a node with ~70 allocated projects at the time of capture.

Environment

Elastic Agent version: 8.15-4-SNAPSHOT
Kubernetes environment: Serverless platform, MKI cluster
Node allocation: ~100 Elasticsearch instances per node for vector search projects

elasticmachine · 2024-11-12T21:50:59Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm · 2024-11-13T10:13:37Z

Having looked at the diagnostics and telemetry for one of the agent Pods encountering this problem, my initial thoughts are as follows:

The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that.
There's a lot of configuration churn on the elastic-agent side caused by the kubernetes variable provider, and that's most likely what leads to the agent itself using more memory than expected. I'm tackling that in Elastic agent uses too much memory per Pod in k8s #5835 .
In beats themselves, needing to reload configuration frequently also adds up to additional memory consumption and possibly other kinds of disruption (If we have a scraper that is supposed to fetch metrics every 10 seconds, but we reload config every 5 seconds, then we're effectively scraping every 5 seconds instead). If the beats config manager could avoid restarting units it doesn't need to restart, this effect could be mitigated.

swiatekm · 2024-11-13T13:49:36Z

@elastic/stack-monitoring do we have a synthetic benchmark for the elasticsearch metricbeat module? It would help a lot, as reproducing the environment this issue came up in is a huge pain.

consulthys · 2024-11-14T13:48:48Z

@swiatekm Honestly not to my knowledge, but since we've taken over SM very recently, maybe @miltonhultgren has a more insightful answer to give you.

miltonhultgren · 2024-11-14T15:14:07Z

👋🏼

As far as I know, we don't have any benchmarking (synthetic or otherwise) for the Stack Monitoring modules
"The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that." this, a 100 times this. We are well aware of a few patterns in those modules that really use more memory than is needed, a lot of time is spent shuffling JSON in and out of very ineffective structures. It's honestly good that we don't have benchmarks because they would be horrible and in the past we didn't ever have the resources to try to optimize this. Much is likely low hanging fruit in terms of complexity, it's just the time effort.

elastic/beats#33862

rhr323 added the bug Something isn't working label Nov 11, 2024

ycombinator assigned swiatekm Nov 12, 2024

cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 12, 2024

swiatekm mentioned this issue Nov 13, 2024

Emit Pod data only for running Pods in the Kubernetes provider #6011

Open

4 tasks

swiatekm mentioned this issue Nov 14, 2024

Add var generation benchmark #6028

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

rhr323 commented Nov 11, 2024

elasticmachine commented Nov 12, 2024

swiatekm commented Nov 13, 2024

swiatekm commented Nov 13, 2024

consulthys commented Nov 14, 2024

miltonhultgren commented Nov 14, 2024 •

edited

Loading

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

Comments

rhr323 commented Nov 11, 2024

Issue Summary

Observed Behavior

Environment

elasticmachine commented Nov 12, 2024

swiatekm commented Nov 13, 2024

swiatekm commented Nov 13, 2024

consulthys commented Nov 14, 2024

miltonhultgren commented Nov 14, 2024 • edited Loading

miltonhultgren commented Nov 14, 2024 •

edited

Loading