Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Consumption Issue with Elastic Agent on Kubernetes with high number of resources #5991

Open
rhr323 opened this issue Nov 11, 2024 · 5 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@rhr323
Copy link

rhr323 commented Nov 11, 2024

Issue Summary

In our testing on the serverless platform, we aimed to assess the maximum number of projects that can be supported on a single MKI cluster. We were using the Elastic Agent version 8.15-4-SNAPSHOT to mitigate previously identified memory issues.

Most Elastic Agent instances functioned without issues. However, on nodes hosting vector search projects, where a larger number of Elasticsearch instances and their associated Kubernetes resources (e.g., pods, deployments, services, secrets) are allocated, we observed the Elastic Agent running out of memory. This typically occurred when these nodes were hosting around 100 Elasticsearch instances.

Observed Behavior

  • Elastic Agent on high-density nodes (around 100 Elasticsearch instances) experienced memory exhaustion and got stuck in a crash loop.
  • Diagnostic data was collected from an Elastic Agent on a node with ~70 allocated projects at the time of capture.

Environment

  • Elastic Agent version: 8.15-4-SNAPSHOT
  • Kubernetes environment: Serverless platform, MKI cluster
  • Node allocation: ~100 Elasticsearch instances per node for vector search projects
@rhr323 rhr323 added the bug Something isn't working label Nov 11, 2024
@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm
Copy link
Contributor

Having looked at the diagnostics and telemetry for one of the agent Pods encountering this problem, my initial thoughts are as follows:

  1. The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that.
  2. There's a lot of configuration churn on the elastic-agent side caused by the kubernetes variable provider, and that's most likely what leads to the agent itself using more memory than expected. I'm tackling that in Elastic agent uses too much memory per Pod in k8s #5835 .
  3. In beats themselves, needing to reload configuration frequently also adds up to additional memory consumption and possibly other kinds of disruption (If we have a scraper that is supposed to fetch metrics every 10 seconds, but we reload config every 5 seconds, then we're effectively scraping every 5 seconds instead). If the beats config manager could avoid restarting units it doesn't need to restart, this effect could be mitigated.

@swiatekm
Copy link
Contributor

@elastic/stack-monitoring do we have a synthetic benchmark for the elasticsearch metricbeat module? It would help a lot, as reproducing the environment this issue came up in is a huge pain.

@consulthys
Copy link
Contributor

@swiatekm Honestly not to my knowledge, but since we've taken over SM very recently, maybe @miltonhultgren has a more insightful answer to give you.

@miltonhultgren
Copy link

miltonhultgren commented Nov 14, 2024

👋🏼

  1. As far as I know, we don't have any benchmarking (synthetic or otherwise) for the Stack Monitoring modules
  2. "The metricbeat elasticsearch module itself is, unsurprisingly, using a good chunk of memory. We can probably optimize that." this, a 100 times this. We are well aware of a few patterns in those modules that really use more memory than is needed, a lot of time is spent shuffling JSON in and out of very ineffective structures. It's honestly good that we don't have benchmarks because they would be horrible and in the past we didn't ever have the resources to try to optimize this. Much is likely low hanging fruit in terms of complexity, it's just the time effort.

elastic/beats#33862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants