-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes metadata overwhelms memory limits in the Agent process #4729
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests #4730 |
FWIW the diagnostics described by this issue were from 8.13.3. |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint. |
cc @gizas |
Agent's variable provider API is very opaque, which is probably a big part of this. Agent's @bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables? |
@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate. The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here. Maybe we can sync offline for me to understand more about this? |
Hi all, I was looking at this, and I wanted to know if we are applying any filtering on the data we receive from the k8s metadata. |
Hello all @faec @ycombinator |
Fae is currently in PTO and unfortunately she didn't have time to investigate on this yet. |
We are facing this issue too. We see the elastic agents hitting the current memory limit of 1200Mi. I would greatly appreciate it if this topic could be given higher priority, as it is quite annoying to see the agents using that much memory. |
Hello @rgarcia89, |
FWIW, I think we could apply some meaningful transformers in the informers. We did something very similar in our mki-cost-exporter project: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/costmeter/meter.go#L124C14-L124C26 Obviously, we could ignore a large portion of the information for our specific use case. |
Hey Team, Any update on this issue? Given it's been acknowledged as a high priority but there are no updates on it since months is very worrisome. Can we please prioritize this as we would need to get the agent footprint down as much as possible as provisioning 4 GB of memory would reduce the overall usable RAM available that can be used for customer workloads. |
It's still prioritized. Unfortunately there were other more urgent matters that we're still wrapping up. |
@faec any updates from your part? |
@faec There is one issue that I filled a while ago that I think would help reduce memory usage in the case that a specific provider is not even being used - #3609. By doing that unless the policy references a provider then there is no reason for it to even be running. I think using the same logic as above it could build off your idea of recording exactly which variables will be referenced from the policy. Then inside of the variables storage system used by the composable module, could use this determined information to only store what is needed without having to even change the providers (it could just drop the fields not needed). The issue is the case where a policy now starts referencing a new variable and now that information has been dropped, but the provider already provided all the required information. This is where I do believe the providers will need to be adjusted to be given the list of variables that are referenced in the policy. That will allow them to only do the minimal work required as well as notice if a new variables is added requiring it to push an update to the variable storage so that variable information is now present. |
I’m running into some memory issues with Elastic Agent 8.15. It’s running on Kubernetes, and we limit the memory to 700Mi in the manifest file in Kibana. However, when enabling the system metrics + Kubernetes integration, the process keeps crashing and I get almost no data in. When I raise the limit to 800Mi, it runs stable. This seems related to this GH issue. Here are my test results: Elastic Agent 8.15.0 (only system metrics integration), limit 700Mi:
Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 700Mi:
Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 800Mi:
This setup is being used for (marketing) workshops and it's not a great look to ask attendees to increase the memory limit when the Elastic Agent only uses 2 integrations. |
We had run some scaling tests in the past that propose resource configuration ( based on 8.7) as reference point to compare. At the moment @elastic/obs-ds-hosted-services focus is the Otel native Kubernetes Collection of logs/ metrics and we have no plans to run any scaling tests for elastic agent + integrations (cc @mlunadia ) in current iteration. We can wait and see otel elastic agent memory consumption with latest config and also check current resourcing of elastic agent with system+k8s integration. |
this issue occurs even with very very small workloads, so it's not really about scale testing. This is reproducible on a single node k8s cluster, with 26 total pods running |
Posting the results of my initial investigation. For now, I'm inclined to agree with Michael's conclusion in https://github.com/elastic/sdh-beats/issues/5148#issuecomment-2352771442 that there isn't a regression here. Still, the increase in memory usage from adding more Pods to the Node seems excessive, but it's not clear where it's coming from. Test setup
Findings
|
I would also like to post some results here based on Luca's comment about the OOM in small workloads. I run some tests in multiple versions of elastic agent and I want to share the results. I used a single node cluster in GKE with 38 pods running. Here are the results of Elastic Agent's memory consumption per version: Version 8.15.1
Version 8.14.0
Version 8.13.0
Version 8.12.0
Version 8.11.0
The easy thing to notice here is that the increase in memory that Kubernetes Integration causes to Elastic Agent is almost constant throughout the version changes. That is around 300-350 Mb. It got better actually after some better handling of metadata enrichment in 8.14.0 onwards. Another thing to note is that even without the Kubernetes Integration installed , there is still Kubernetes Provider and I would like to understand @faec comment more. How was this measured? With or without Kubernetes Integration? Which version? |
@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods. |
Yes it is enabled. I kept all the defaults. If disabled, memory consumption with just the binary running is around what you mentioned. |
The jump in 8.14.0 is because of agentbeat, see #4730 |
elastic-agent pod is using 4GB ram. Pods on that host: https://gist.github.com/henrikno/27c4165cd7eec7b3a24c424d8a8dad23, ps aux: https://gist.github.com/henrikno/92634f31dd8a3795ff1ec81b34dc1bf8, elastic-agent using 2.2GB, largest metricbeat (kubernetes-metrics) 1.6GB. It sound a bit similar to topfreegames/maestro#473, where the updates from k8s are coming in too fast compared to how they're getting processed, so they're getting buffered somewhere in memory. |
Looking at the profile supplied by @henrikno, this anomalous memory consumption is caused by storing ReplicaSet data. @neiljbrookes confirmed on Slack that the K8s clusters in question have a lot of Deployments, and consequently ReplicaSets. For example, we have ~7000 Deployments and ~75000 ReplicaSets in a particularly troublesome cluster. The heap profile shows ~700 MB of steady-state memory usage, which comes out to around 10KB per ReplicaSet, which a reasonable value. The Agents going OOM was mitigated by setting I'm planning to submit a fix that will cause us to store only the necessary data shortly. Stopping the churn is going to be a bit more challenging, but I think we should be able to solve it by only subscribing to metadata changes from these ReplicaSets. This will be more challenging to integrate into our autodiscovery framework, but is also less urgent. Worth noting that I don't believe this is the problem causing unexpected agent memory consumption on Nodes with a lot of Pods, even in small clusters. |
@swiatekm is the replicasetWatcher enabled by hand in the kubernetes provider you are using? The only way that replicasetWatcher is by default enabled is if you are using the state_replicaset integration for metrics only. |
@MichaelKatsoulis The SRE team have deployment metadata enabled in the kubernetes provider:
This enables the ReplicaSet watcher. Like I said earlier, I don't think this is the root cause of the increased memory utilization on Nodes with large numbers of Pods. |
I moved the ReplicaSet problem to #5623, as it's confirmed and relatively straightforward to address. Will keep troubleshooting the excess memory usage on Nodes with lots of Pods in this issue. |
…cts (#109) We only use metadata from Jobs and ReplicaSets, but require that full resources are supplied. This change relaxes this requirement, allowing PartialObjectMetadata resources to be used. This allows callers to use metadata informers and avoid having to receive and deserialize non-metadata updates from the API Server. See elastic/elastic-agent#5580 for an example of how this could be used. I'm planning to add the metadata informer from that PR to this library as well. Together, these will allow us to greatly reduce memory used for processing and storing ReplicaSets and Jobs in beats and elastic-agent. This is will help elastic/elastic-agent#5580 and elastic/elastic-agent#4729 specifically, and elastic/elastic-agent#3801 in general.
Several different problems impacting agent memory consumption have been discussed in this issue and some of the linked issues. I'd like to summarize the current state and work towards closing this in favor of more specific sub-issues.
If there's anything I missed, please let me know. Once I open an issue for 3, I'd like to close this one. |
Sounds good to me, thanks for getting to the bottom of this. |
I've moved the per-Pod memory issue to #5835. I'm going to close this one to keep the discussion focused. Feel free to reopen if you believe you're facing an issue different than the ones listed in #4729 (comment). If you want to verify if the fixes address your specific problem, you can use the current snapshot build for any branch. |
Diagnostics from production Agents running on Kubernetes show:
elastic-agent-autodiscover
and the other 20% is from helpers internal toelastic-agent
.We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.
Definition of done
The text was updated successfully, but these errors were encountered: