Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801

Open
4 of 10 tasks
gizas opened this issue Nov 22, 2023 · 15 comments
Open
4 of 10 tasks
Assignees
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@gizas
Copy link
Contributor

gizas commented Nov 22, 2023

Backround

The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.

Goals

This issue tries to summarise the next actions we need in order to investigate:

  • The current resource consumption of default Elastic Agent with K8s Integration
  • Several alternative ways that we can offer in order to minimise the impact in different k8s environments and customer setups, regarding resource consumption of k8s cluster.

Actions

Current Actions

We have observed until now that:
a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863)
b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161)
c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.

Unti now:

  • Since 8.11 we have updated the elastic-agent-autodiscover, beats PR to v0.6.4. Disabling metadata for deployment and cronjob. Pods that will be created from deployments or cronjobs will not have the extra metadata field for kubernetes.deployment or kubernetes.cronjob.
  • We have merged leader election configuration variables
  • Proposing a way to disable Leader Election in Managed Elastic Agents (See here)

Next Planned Actions

Future Plans/Actions

@gizas gizas self-assigned this Nov 22, 2023
@gizas gizas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Nov 22, 2023
@axw
Copy link
Member

axw commented Nov 23, 2023

Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption

Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever).

Then we can:

  • consider performing those tests regularly to ensure we don't regress
  • more rapidly iterate on improvements and bug fixes

@gizas
Copy link
Contributor Author

gizas commented Nov 23, 2023

Thanks @axw , I have updated a bit the section Next actions and added some previous ideas/issues that we can investigate here

@bturquet bturquet changed the title [Meta]Investigate resource consumption of Elastic Agent with K8s Inegration [Meta]Investigate resource consumption of Elastic Agent with K8s Integration Dec 1, 2023
@lucabelluccini
Copy link
Contributor

As a short-term, can we somehow document the known issues / limitations we're facing until now?

@dimm0
Copy link

dimm0 commented Jun 6, 2024

Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade.

@cmacknz
Copy link
Member

cmacknz commented Jun 7, 2024

We have tracked down the source of the high memory usage on k8s and are working to fix it. #4729 is the tracking issue.

@dimm0
Copy link

dimm0 commented Jun 7, 2024

And what about rate-limiting the k8s apiserver requests? Is any work going on that?

@gizas
Copy link
Contributor Author

gizas commented Jun 11, 2024

what about rate-limiting the k8s apiserver requests

Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog

Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption

@constanca-m
Copy link
Contributor

constanca-m commented Sep 13, 2024

Test setup

I have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0.

The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511.

I annotated the following results after 5min for each cycle:

  • Pods: number of pods in the cluster.
  • CPU: CPU usage of EA.
  • Memory: Memory usage of EA.
  • EA pod restarts: Restarts of EA so far.

Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable.

This is the script I am running for the tests.
setup_cluster () {
   kind delete cluster
   kind create cluster
   # This is so we can execute kubectl top
   kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml
}

test_n_pods () {
  # $1 - EA filename to used in kubectl apply
  # $2 - filename for the results
  # Prepare cluster with EA using kubernetes + system policy
  setup_cluster
  kubectl apply -f "$1"

  echo "| Pods | CPU | Memory | EA pod restarts |" > "$2"
  echo "|------|-----|--------|-----------------|" >> "$2"

  for replicas in 1 50 100 150 200 300 400 500 ;
    do
      kubectl delete -f nginx-pod.yaml
      sed -i -e "s/  replicas: .*/  replicas: $replicas/g" nginx-pod.yaml
      kubectl apply -f nginx-pod.yaml
      sleep 5m

      top=$(kubectl top pods -n kube-system | grep elastic*)
      pods=$(kubectl get pods --no-headers --all-namespaces | wc -l)
      line=$(kubectl get pods -o wide --all-namespaces | awk '$2 ~ /^elastic/')
      restarts=$(echo "$line" | awk '{print  $5}')

      print_results_to_file "$pods" "$top" "$restarts" "$2"
    done
}

print_results_to_file () {
    # Gets arguments:
    # $1 = number of pods
    # $2 = kubectl top result
    # $3 = number of EA restarts
    # $4 = results filename

    # Parse result of kubectl top (example 'elastic-agent-985zk                          16m          583Mi')
    cpu=$(echo "$2" | awk '{print  $2}')
    memory=$(echo "$2" | awk '{print  $3}')
    echo "| $1 | $cpu | $memory | $3 |" >> "$4"
}

# Test the performance by running test_n_pods. Change the arguments to your own.
test_n_pods <DEPLOYMENT EA FILE GOES HERE> <RESULTS FILENAME GOES HERE>
This is the NGINX pod deployment I use in the script.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 500
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          ports:
            - containerPort: 80

8.5

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 35m 281Mi 0
61 115m 410Mi 0
111 272m 399Mi 0
161 852m 491Mi 0
211 923m 441Mi 0
311 770m 445Mi 0
411 625m 450Mi 0
511 342m 414Mi 0

8.6

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 33m 407Mi 0
61 4

No longer works at 61 up.

8.7

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 32m 431Mi 0
61 4

No longer works at 61 up test.

8.8 - default agent configuration changes

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 24m 378Mi 0
61 94m 489Mi 0
111 298m 596Mi 0
161 1

No longer works at 161 up test.

8.9

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 32m 421Mi 0
61 92m 533Mi 0
111 250m 639Mi 0
161 1

No longer works at 161 up test.

8.10

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 25m 424Mi 0
61 90m 543Mi 0
111 2

No longer works at 111 up test.

8.11

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 14m 435Mi 0
61 54m 577Mi 0
111 2

No longer works at 111 up test.

8.12

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 15m 445Mi 0
61 54m 604Mi 0
111 2

No longer works at 111 up test.

8.13

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 14m 441Mi 0
61 51m 538Mi 0
111 2

No longer works at 111 up test.

8.14

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 13m 580Mi 0
61 1

No longer works at 61 up test.

8.15

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 28m 595Mi 0
61 1

No longer works at 61 up test.


Notes

From 8.5 to 8.6 version, something changed that caused a huge memory increase in the Kubernetes integration, to the point that increasing the number of pods made the agent stop and restart over and over again.

From 8.8 version, the number of pods that made the agent stop increase. This is a good sign, but notice that the default memory limits and requests also increase. This surely helps explain this seemingly better performance.

From 8.9 to 8.10 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance.

From 8.13 to 8.14 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance. Also, from @gizas: 8.13 vs 8.14 is 140Mi diff even with no of 12 pods.

It seems Kubernetes memory usage has been getting higher since 8.5, with notable increases in 8.6, 8.10 and 8.14 (discarding the increase of memory resources by default in EA at 8.8 version that helped hide the possible issues in Kubernetes integration).

@EvelienSchellekens
Copy link

@constanca-m Have you also tested if the data is actually send to Elastic? My setup had ~15 pods with 8.15 and the memory ran high and even though the pod itself didn't restart, the K8s data didn't come in (or was very spotty) (I think one of the processes itself was crashing)

@constanca-m
Copy link
Contributor

constanca-m commented Sep 13, 2024

Have you also tested if the data is actually send to Elastic?

In my case, I can see data in Discover (I am filtering by kubernetes.container.name):

Image

I did not analyze the logs to know if everything is being sent there, or we are loosing data. This is the logs from running all the tests in 8.15, including the pod restarts. @EvelienSchellekens

@gizas
Copy link
Contributor Author

gizas commented Sep 13, 2024

Really useful @constanca-m !

Adding some notes here:

  • for eg in 8.12 I see 61 pods and memory 604Mi but 0 restarts. How is it possible with limit 500 to have more memory?
  • 8.13 vs 8.14 is 140Mi diff even with no of 12 pods ?
  • We dont have the starting agent memory without k8s integration to see elastic agent's starting memory and be able to calculate as well the k8s integration overhead.
    • Does above tests include system integration?
  • In all above tests we dont produce logs right?

A general comment is that all the above tests just measure the consumption of memory with same k8s load. The identification of memory leak needs watching the trend of memory during time. Just saying that just an increase is not actually bad or good if we observe more k8s resources.

Additionally:

  • Since 8.11 we disable the deployment and cronjob metadata enrichment. Maybe the cluster did not have any cronjobs and the deployments were very few to see any improvement
  • Since 8.9 we have introduced the replicaset and Job Metadata generation . Same as above could explain some increase
  • I would expect 8.15 to have less memory as we introduced https://github.com/elastic/beats/pull/38762/files#

@constanca-m
Copy link
Contributor

constanca-m commented Sep 13, 2024

Thank you @gizas.

I think this issue and the scripts to run these tests should be placed somewhere more accessible to the team. Maybe in the future repository you mentioned on Thursday's meeting to help with identifying issues.

for eg in 8.12 I see 61 pods and memory 604Mi but 0 restarts. How is it possible with limit 500 to have more memory?

It says the limits memory is 700Mi for 8.12.

8.13 vs 8.14 is 140Mi diff even with no of 12 pods ?

It looks like it... This was just 1 tests, and values always variate a bit for each test. We could run a test with less increase in pods to capture more the differences between these latest versions.

Edit: but since 8.15 has more or less the same values as 8.14, I believe that we do have a significant difference between 8.13 and 8.14 like you pointed out. Thanks, I will include it in the notes of the original comment as well!

Does above tests include system integration?

Yes. You are correct, we don't include the tests for running just with the System, unfortunately. I agree, with would be good to also have an idea of that, but I don't believe the System here is causing any issues.

In all above tests we dont produce logs right?

This is the hard part! With the agent starting and restarting over and over again... It is very hard, and downloading the diagnosis gets stuck in a loop, and the zip never gets ready. Not sure what is going on there, but I have not payed much attention to it.

Since #3593 we disable the deployment and cronjob metadata enrichment. Maybe the cluster did not have any cronjobs and the deployments were very few to see any improvement
Since elastic/beats#35483 we have introduced the replicaset and Job Metadata generation . Same as above could explain some increase

Correct. Only the default pods, EA, metrics server and the NGINX pod.


I believe the best would be to look at the changelog, see what big changes we had. I can remember the watchers issue, but since that PR has the memory tests there, I don't believe that could cause any influence on the degraded performance, but I could of course be wrong (and biased 😄 ).

@gizas
Copy link
Contributor Author

gizas commented Sep 13, 2024

@constanca-m
Copy link
Contributor

I used a different one @gizas, it is local and more simplified (in the comment of the tests results). I think it should be enough for these tests, and that script for more complex tests.

@MichaelKatsoulis
Copy link
Contributor

I also performed some scale tests. I create one node cluster in GKE with ~95 pods running.
I tested versions 8.13.0 and 8.14.0 with and without kube-state-metrics to simulate leader and not leader nodes scenarios.

TBH the 700 mb memory limit suffices in both versions. Only in case Kube-state-metrics are enabled I got one restart which means that in big clusters (note that in Kubernetes 110 pods per node is the limit) the memory limit needs some adjustment.
Versions 8.13.0 and 8.14.0 do not seem to have big differences.
For 8.13 agent's pod memory was around 600Mb , while for 8.14.0 it was around 640Mb.
In all cases I used nginx pods but there was really many logs generated.

I don't know why @constanca-m got different results.

swiatekm added a commit to elastic/elastic-agent-autodiscover that referenced this issue Oct 3, 2024
…cts (#109)

We only use metadata from Jobs and ReplicaSets, but require that full
resources are supplied. This change relaxes this requirement, allowing
PartialObjectMetadata resources to be used. This allows callers to use
metadata informers and avoid having to receive and deserialize
non-metadata updates from the API Server.

See elastic/elastic-agent#5580 for an example of
how this could be used. I'm planning to add the metadata informer from
that PR to this library as well. Together, these will allow us to
greatly reduce memory used for processing and storing ReplicaSets and
Jobs in beats and elastic-agent.

This is will help elastic/elastic-agent#5580 and
elastic/elastic-agent#4729 specifically, and
elastic/elastic-agent#3801 in general.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

8 participants