Kubernetes metadata overwhelms memory limits in the Agent process #4729

faec · 2024-05-09T19:10:48Z

Diagnostics from production Agents running on Kubernetes show:

The elastic-agent process itself uses more memory than all its configured inputs combined.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers. 70% of that is from elastic-agent-autodiscover and the other 20% is from helpers internal to elastic-agent.

We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.

Definition of done

Provide steps for a reproducible setup that can demonstrate the aforementioned memory usage with an Agent diagnostic
Attach Agent diagnostic to this issue to use as a baseline, so we can compare against it when improvements are made
Reduce memory use by Kubernetes helpers from 90% to TBD% (TBD, at the moment, until we've done more investigation)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-09T19:10:50Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2024-05-09T20:24:11Z

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests #4730

faec · 2024-05-16T11:46:18Z

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests

FWIW the diagnostics described by this issue were from 8.13.3.

elasticmachine · 2024-05-21T14:04:43Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 · 2024-05-21T14:05:13Z

After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.

bturquet · 2024-05-21T16:58:28Z

cc @gizas

faec · 2024-05-22T20:32:52Z

Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. #2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.

@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?

gizas · 2024-05-23T06:51:20Z

@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate.
On kubernetes provider here we start the watchers but with general arguments.

The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.

Maybe we can sync offline for me to understand more about this?

cc @MichaelKatsoulis

alexsapran · 2024-06-21T07:31:50Z

Hi all,

I was looking at this, and I wanted to know if we are applying any filtering on the data we receive from the k8s metadata.
Does all data need to be cached locally in the local k8s cache? I'd like to know if we can apply any transformation to nullify some of the fields and keep only the ones we care about; this way, the RSS memory of the Elastic Agent will hold only the data we care about and will not be influenced by the size of the k8s cluster.

neiljbrookes · 2024-07-08T14:00:06Z

Hello all @faec @ycombinator
Is there any update on this issue ? I am planing an upgrade to 8.14.1 this week, do we anticipate any improvements ?

pierrehilbert · 2024-07-08T14:12:54Z

Fae is currently in PTO and unfortunately she didn't have time to investigate on this yet.
This is planned in the current sprint (that started today).

rgarcia89 · 2024-07-16T08:57:24Z

We are facing this issue too. We see the elastic agents hitting the current memory limit of 1200Mi. I would greatly appreciate it if this topic could be given higher priority, as it is quite annoying to see the agents using that much memory.

pierrehilbert · 2024-07-16T09:39:19Z

Hello @rgarcia89,
This topic has an high priority but as you can imagine, this was not the only one.
@faec will soon start to look at this so I hope we will soon be able to share good news.

nimdanitro · 2024-08-22T11:48:39Z

FWIW, I think we could apply some meaningful transformers in the informers. We did something very similar in our mki-cost-exporter project: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/costmeter/meter.go#L124C14-L124C26
here is an example of the cache.TransformFunc which we set to our informers: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/informers/transformer/transform.go#L34

Obviously, we could ignore a large portion of the information for our specific use case.

yuvielastic · 2024-08-22T13:13:42Z

Hey Team, Any update on this issue? Given it's been acknowledged as a high priority but there are no updates on it since months is very worrisome.

Can we please prioritize this as we would need to get the agent footprint down as much as possible as provisioning 4 GB of memory would reduce the overall usable RAM available that can be used for customer workloads.

amitkanfer · 2024-08-22T13:52:28Z

It's still prioritized. Unfortunately there were other more urgent matters that we're still wrapping up.

zez3 · 2024-08-22T14:10:32Z

@faec any updates from your part?

blakerouse · 2024-08-22T18:05:32Z

@faec There is one issue that I filled a while ago that I think would help reduce memory usage in the case that a specific provider is not even being used - #3609. By doing that unless the policy references a provider then there is no reason for it to even be running.

I think using the same logic as above it could build off your idea of recording exactly which variables will be referenced from the policy. Then inside of the variables storage system used by the composable module, could use this determined information to only store what is needed without having to even change the providers (it could just drop the fields not needed).

The issue is the case where a policy now starts referencing a new variable and now that information has been dropped, but the provider already provided all the required information. This is where I do believe the providers will need to be adjusted to be given the list of variables that are referenced in the policy. That will allow them to only do the minimal work required as well as notice if a new variables is added requiring it to push an update to the variable storage so that variable information is now present.

EvelienSchellekens · 2024-09-03T10:41:23Z

I’m running into some memory issues with Elastic Agent 8.15. It’s running on Kubernetes, and we limit the memory to 700Mi in the manifest file in Kibana. However, when enabling the system metrics + Kubernetes integration, the process keeps crashing and I get almost no data in. When I raise the limit to 800Mi, it runs stable. This seems related to this GH issue.

Here are my test results:

Elastic Agent 8.15.0 (only system metrics integration), limit 700Mi:

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   21m          442Mi

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 700Mi:
-> keeps crashing, no data

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   236m         699Mi

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 800Mi:
-> runs stable

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-dbzzm   52m          703Mi

This setup is being used for (marketing) workshops and it's not a great look to ask attendees to increase the memory limit when the Elastic Agent only uses 2 integrations.

gizas · 2024-09-12T13:57:18Z

We had run some scaling tests in the past that propose resource configuration ( based on 8.7) as reference point to compare.

At the moment @elastic/obs-ds-hosted-services focus is the Otel native Kubernetes Collection of logs/ metrics and we have no plans to run any scaling tests for elastic agent + integrations (cc @mlunadia ) in current iteration.

We can wait and see otel elastic agent memory consumption with latest config and also check current resourcing of elastic agent with system+k8s integration.

LucaWintergerst · 2024-09-16T07:49:22Z

this issue occurs even with very very small workloads, so it's not really about scale testing.

This is reproducible on a single node k8s cluster, with 26 total pods running

swiatekm · 2024-09-17T09:57:51Z

Posting the results of my initial investigation. For now, I'm inclined to agree with Michael's conclusion in https://github.com/elastic/sdh-beats/issues/5148#issuecomment-2352771442 that there isn't a regression here. Still, the increase in memory usage from adding more Pods to the Node seems excessive, but it's not clear where it's coming from.

Test setup

Single node KiND cluster, default settings.
Fleet-managed Agent installed as per the official instructions.
System and Kubernetes integrations with default settings (at least initially).
98 Nginx Pods running in the cluster, producing no logs.

Findings

The memory increase does seem primarily related to the kubernetes variable provider. It can be reproduced even with all the data collection disabled in the Kubernetes integration.
Memory usage does appear to scale with the number of Pods running on the Node, even if those Pods aren't actually logging anything.
Since the amount of metadata from a single Node shouldn't be enough to cause this effect, I thought that maybe we were getting unnecessary var updates from the provider. But tweaking the debounce delay value didn't provide a measurable improvement.

MichaelKatsoulis · 2024-09-17T11:04:25Z

I would also like to post some results here based on Luca's comment about the OOM in small workloads. I run some tests in multiple versions of elastic agent and I want to share the results.

I used a single node cluster in GKE with 38 pods running. Here are the results of Elastic Agent's memory consumption per version:

Version 8.15.1

Integration	Memory Consumption
no integration	280-330 Mb
system	450-500 Mb
Kubernetes	550-600 Mb
Kubernetes & system	740-790 Mb (restarts)

Version 8.14.0

Integration	Memory Consumption
no integration	260-290 Mb
system	410-430 Mb
Kubernetes	550-570 Mb
Kubernetes & system	700-730 Mb

Version 8.13.0

Integration	Memory Consumption
no integration	200-210 Mb
system	320-330 Mb
Kubernetes	500-510 Mb
Kubernetes & system	630-650 Mb

Version 8.12.0

Integration	Memory Consumption
no integration	180-185 Mb
system	300-330 Mb
Kubernetes	480-520 Mb
Kubernetes & system	630-680 Mb

Version 8.11.0

Integration	Memory Consumption
no integration	169-190 Mb
system	300-310 Mb
Kubernetes	520-550 Mb
Kubernetes & system	660-720 Mb (restart)

The easy thing to notice here is that the increase in memory that Kubernetes Integration causes to Elastic Agent is almost constant throughout the version changes. That is around 300-350 Mb. It got better actually after some better handling of metadata enrichment in 8.14.0 onwards.
Elastic Agent with no integration at all memory consumption increased over the version bumps and the installation of Kubernetes and System(comes as default) reached the set limit of 700 Mb.
I don't know if the 300Mb that kubernetes integration adds is a lot or not. But considering that system integration which does way less (no constant API calls to k8s) adds around 150 Mb, I could argue that is reasonable.

Another thing to note is that even without the Kubernetes Integration installed , there is still Kubernetes Provider and add_kubernetes_metadata processor enabled by default. I took a look at the heap.pprof of such an agent and Kubernetes related functions seem to be using around 10 %.

I would like to understand @faec comment more.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers

How was this measured? With or without Kubernetes Integration? Which version?

swiatekm · 2024-09-17T11:16:08Z

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

MichaelKatsoulis · 2024-09-17T11:48:11Z

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

Yes it is enabled. I kept all the defaults. If disabled, memory consumption with just the binary running is around what you mentioned.

cmacknz · 2024-09-17T15:49:42Z

Elastic Agent with no integration at all memory consumption increased over the version bumps

The jump in 8.14.0 is because of agentbeat, see #4730

henrikno · 2024-09-20T02:37:01Z

elastic-agent pod is using 4GB ram. Pods on that host: https://gist.github.com/henrikno/27c4165cd7eec7b3a24c424d8a8dad23, ps aux: https://gist.github.com/henrikno/92634f31dd8a3795ff1ec81b34dc1bf8, elastic-agent using 2.2GB, largest metricbeat (kubernetes-metrics) 1.6GB.

It sound a bit similar to topfreegames/maestro#473, where the updates from k8s are coming in too fast compared to how they're getting processed, so they're getting buffered somewhere in memory.

swiatekm · 2024-09-20T12:27:47Z

Looking at the profile supplied by @henrikno, this anomalous memory consumption is caused by storing ReplicaSet data. @neiljbrookes confirmed on Slack that the K8s clusters in question have a lot of Deployments, and consequently ReplicaSets. For example, we have ~7000 Deployments and ~75000 ReplicaSets in a particularly troublesome cluster. The heap profile shows ~700 MB of steady-state memory usage, which comes out to around 10KB per ReplicaSet, which a reasonable value.

The Agents going OOM was mitigated by setting GOGC to 25, which suggests that churn from excessive updates from the API Server is part of the problem as well.

I'm planning to submit a fix that will cause us to store only the necessary data shortly. Stopping the churn is going to be a bit more challenging, but I think we should be able to solve it by only subscribing to metadata changes from these ReplicaSets. This will be more challenging to integrate into our autodiscovery framework, but is also less urgent.

Worth noting that I don't believe this is the problem causing unexpected agent memory consumption on Nodes with a lot of Pods, even in small clusters.

MichaelKatsoulis · 2024-09-23T08:44:47Z

@swiatekm is the replicasetWatcher enabled by hand in the kubernetes provider you are using?
Because by default it is disabled by this setting as part of add_resource_metadata configuration.

The only way that replicasetWatcher is by default enabled is if you are using the state_replicaset integration for metrics only.

swiatekm · 2024-09-23T09:33:23Z

@MichaelKatsoulis The SRE team have deployment metadata enabled in the kubernetes provider:

    providers:
      kubernetes:
        node: ${NODE_NAME}
        scope: node
        hints.enabled: false
        kubernetes_secrets:
          enabled: true
        add_resource_metadata:
          deployment: true

This enables the ReplicaSet watcher.

Like I said earlier, I don't think this is the root cause of the increased memory utilization on Nodes with large numbers of Pods.

swiatekm · 2024-09-30T10:33:41Z

I moved the ReplicaSet problem to #5623, as it's confirmed and relatively straightforward to address. Will keep troubleshooting the excess memory usage on Nodes with lots of Pods in this issue.

lepouletsuisse · 2024-10-02T09:12:31Z

I have the same issue in my small K8s cluster with 2 of my agents (I have ~10 agents in total). This agent runs alone (as a deployment) for my single integrations that don't need to run on all nodes, although I also have a deamonset with my agents for other purposes (Metrics, logs, etc...). This is Elastic-agent version 8.15.2.
Note that the single pod coming from the deployment has the memory issue, but also only 1 of the pods created by the deamonset (not all, this is probably related to the node workload).
I bumped the memory request to 1Gb and the memory limit to 4Gb to debug and I found interesting the fact that the memory increased a lot at the beginning but came back to the normal ~500Mb after ~10 minutes.

I tried to restart the pod to check if I observe the same memory behaviour and it behaved the same.

I hope it can help to debug the issue!

…cts (#109) We only use metadata from Jobs and ReplicaSets, but require that full resources are supplied. This change relaxes this requirement, allowing PartialObjectMetadata resources to be used. This allows callers to use metadata informers and avoid having to receive and deserialize non-metadata updates from the API Server. See elastic/elastic-agent#5580 for an example of how this could be used. I'm planning to add the metadata informer from that PR to this library as well. Together, these will allow us to greatly reduce memory used for processing and storing ReplicaSets and Jobs in beats and elastic-agent. This is will help elastic/elastic-agent#5580 and elastic/elastic-agent#4729 specifically, and elastic/elastic-agent#3801 in general.

swiatekm · 2024-10-22T09:48:42Z

Several different problems impacting agent memory consumption have been discussed in this issue and some of the linked issues. I'd like to summarize the current state and work towards closing this in favor of more specific sub-issues.

Agent and beats store too much ReplicaSet data, leading to high memory consumption in large clusters. Addressed by Agent and beats store too much ReplicaSet data in K8s #5623.
General memory consumption increase between 8.14 and 8.15. I'm confident this is Queue keeps stale event data in memory in 8.15 beats#41355.
Agent itself using too much memory when there are a lot of Pods running on the Node. Will be split into its own issue. EDIT: Elastic agent uses too much memory per Pod in k8s #5835

If there's anything I missed, please let me know. Once I open an issue for 3, I'd like to close this one.

cmacknz · 2024-10-22T15:43:38Z

Sounds good to me, thanks for getting to the bottom of this.

swiatekm · 2024-10-23T15:44:04Z

I've moved the per-Pod memory issue to #5835. I'm going to close this one to keep the discussion focused. Feel free to reopen if you believe you're facing an issue different than the ones listed in #4729 (comment). If you want to verify if the fixes address your specific problem, you can use the current snapshot build for any branch.

faec added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 9, 2024

cmacknz mentioned this issue May 9, 2024

ECK TestFleet* is failing elastic/cloud-on-k8s#7790

Open

jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 21, 2024

jlind23 assigned faec May 21, 2024

cmacknz mentioned this issue Jun 7, 2024

[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801

Open

10 tasks

barkbay mentioned this issue Jun 10, 2024

[E2E] Increase Agent memory on OpenShift elastic/cloud-on-k8s#7884

Merged

ycombinator assigned swiatekm and unassigned faec Sep 12, 2024

ycombinator added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 12, 2024

swiatekm mentioned this issue Sep 20, 2024

Reduce the amount of stored ReplicaSet data #5580

Closed

7 tasks

This was referenced Sep 25, 2024

Observability Kubernetes Onboarding doesn't ship data #5613

Open

Adjust memory requests and limits for elastic-agent when run in Kubernetes cluster #5614

Merged

This was referenced Sep 27, 2024

Allow ReplicaSet and Job metadata generators to use partial meta objects elastic/elastic-agent-autodiscover#109

Merged

Agent and beats store too much ReplicaSet data in K8s #5623

Closed

This was referenced Oct 2, 2024

[8.15](backport #5614) Adjust memory requests and limits for elastic-agent when run in Kubernetes cluster #5656

Merged

[8.x](backport #5614) Adjust memory requests and limits for elastic-agent when run in Kubernetes cluster #5657

Merged

swiatekm mentioned this issue Oct 21, 2024

Fix event deletion in the memqueue elastic/beats#41340

Closed

4 tasks

swiatekm mentioned this issue Oct 23, 2024

Elastic agent uses too much memory per Pod in k8s #5835

Open

swiatekm closed this as completed Oct 23, 2024

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Comments

faec commented May 9, 2024 • edited by ycombinator Loading

Definition of done

elasticmachine commented May 9, 2024

cmacknz commented May 9, 2024 • edited Loading

faec commented May 16, 2024

elasticmachine commented May 21, 2024

jlind23 commented May 21, 2024

bturquet commented May 21, 2024

faec commented May 22, 2024

gizas commented May 23, 2024 • edited Loading

alexsapran commented Jun 21, 2024

neiljbrookes commented Jul 8, 2024

pierrehilbert commented Jul 8, 2024

rgarcia89 commented Jul 16, 2024

pierrehilbert commented Jul 16, 2024

nimdanitro commented Aug 22, 2024

yuvielastic commented Aug 22, 2024

amitkanfer commented Aug 22, 2024

zez3 commented Aug 22, 2024

blakerouse commented Aug 22, 2024

EvelienSchellekens commented Sep 3, 2024

gizas commented Sep 12, 2024 • edited Loading

LucaWintergerst commented Sep 16, 2024 • edited Loading

swiatekm commented Sep 17, 2024 • edited Loading

Test setup

Findings

MichaelKatsoulis commented Sep 17, 2024 • edited Loading

Version 8.15.1

Version 8.14.0

Version 8.13.0

Version 8.12.0

Version 8.11.0

swiatekm commented Sep 17, 2024

MichaelKatsoulis commented Sep 17, 2024

cmacknz commented Sep 17, 2024

henrikno commented Sep 20, 2024 • edited by alexsapran Loading

swiatekm commented Sep 20, 2024 • edited Loading

MichaelKatsoulis commented Sep 23, 2024

swiatekm commented Sep 23, 2024

swiatekm commented Sep 30, 2024

lepouletsuisse commented Oct 2, 2024

swiatekm commented Oct 22, 2024 • edited Loading

cmacknz commented Oct 22, 2024

swiatekm commented Oct 23, 2024

faec commented May 9, 2024 •

edited by ycombinator

Loading

cmacknz commented May 9, 2024 •

edited

Loading

gizas commented May 23, 2024 •

edited

Loading

gizas commented Sep 12, 2024 •

edited

Loading

LucaWintergerst commented Sep 16, 2024 •

edited

Loading

swiatekm commented Sep 17, 2024 •

edited

Loading

MichaelKatsoulis commented Sep 17, 2024 •

edited

Loading

henrikno commented Sep 20, 2024 •

edited by alexsapran

Loading

swiatekm commented Sep 20, 2024 •

edited

Loading

swiatekm commented Oct 22, 2024 •

edited

Loading