Emit Pod data only for running Pods in the Kubernetes provider #6011

swiatekm · 2024-11-13T12:36:55Z

What does this PR do?

The kubernetes provider emits data each time a Pod gets updated, even if the Pod is not yet running. As a result, every time a new Pod is spawned, we get multiple updates as it is created, scheduled, containers are created, and so on. Instead, emit the data only if the Pod is actually running.

Why is it important?

Configuration reloading can be quite expensive when there are a lot of Pods on the Node. We should avoid doing so unnecessarily. This change should help #5835 and #5991.

Note that in principle this exposes us to a new failure mode. It's possible for the Pod to successfully finish running and be removed before we push the new configuration to beats. This was much less likely when we included the Pod when it was scheduled, but still possible with our 100ms debounce timer in the coordinator. I think the tradeoff is worth it, considering the issues with config reloading on large Nodes.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Deploy the agent in a local kind cluster using either the default manifests or the Helm Chart.

Related issues

mergify · 2024-11-13T12:37:33Z

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-13T12:37:33Z

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

pkoutsovasilis

This change makes sense to me; LGTM

gizas · 2024-11-13T15:19:31Z

@swiatekm thanks for this. This makes sense.

I want to make sure some corner cases like:

pods created from cronjobs
pods that are restarting (eg. OOM or Imagepull error) how this will behave in such cases?

I guess this is a big change and before merging I want like some manual e2e tests here to be included

swiatekm · 2024-11-13T15:36:03Z

@swiatekm thanks for this. This makes sense.

I want to make sure some corner cases like:
* pods created from cronjobs

There isn't really any difference. It's possible for the Pod to be finished running before we update the configuration, but this is possible right now too. I'm not sure if this is a problem either way - scrapers probably wouldn't get any metrics, and container logs should remain for filebeat to read until the Pod is deleted.

* pods that are restarting (eg. OOM or Imagepull error) how this will behave in such cases?

It should be noted that this PR changes when the metadata stored by the kubernetes provider is updated, and when notifications are delivered to the coordinator. Once we store a Pod's metadata once (that is, it becomes Running at any point), we'll continue to include it until it gets deleted.

If the container gets OOMKilled in the meantime, we'll simply continue using the same metadata we already have.

If there's an image pulling problem, the Pod will never start any of its containers, and we'll never have metadata about it.

I guess this is a big change and before merging I want like some manual e2e tests here to be included

What kind of scenarios would you like me to test? I've already done the happy paths with the current standalone manifest.

blakerouse · 2024-11-13T15:39:01Z

@swiatekm thanks for this. This makes sense.
I want to make sure some corner cases like:
* pods created from cronjobs
There isn't really any difference. It's possible for the Pod to be finished running before we update the configuration, but this is possible right now too. I'm not sure if this is a problem either way - scrapers probably wouldn't get any metrics, and container logs should remain for filelog to read until the Pod is deleted.

We actually delay the removal of a pod after it is stopped to ensure that it exists for a period of time. This is needed for reading the log files for a reason (think nginx access logs from inside of the container). I think this should handle the case where the status is basically greater than > Running. As in crashed, stopping, stopped, etc.

swiatekm · 2024-11-13T15:59:40Z

For the record, I considered other approaches to this problem, but trying to debounce the updates in a more clever way made things more difficult to reason about and ultimately not much better than the current solution.

internal/pkg/composable/providers/kubernetes/pod.go

blakerouse

Looks good, thanks for the change.

elastic-sonarqube · 2024-11-18T11:35:34Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

(cherry picked from commit 434986e)

…ic#6011)

#6056) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

#6054) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

(cherry picked from commit 434986e)

#6055) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

swiatekm added the enhancement New feature or request label Nov 13, 2024

swiatekm requested a review from ChrsMark November 13, 2024 12:37

mergify bot assigned swiatekm Nov 13, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 13, 2024

swiatekm requested a review from pkoutsovasilis November 13, 2024 13:07

swiatekm force-pushed the fix/k8sprovider/pod-update-running branch from adeed61 to e1d996f Compare November 13, 2024 14:46

swiatekm marked this pull request as ready for review November 13, 2024 14:47

swiatekm requested review from a team as code owners November 13, 2024 14:47

swiatekm requested review from gizas, MichaelKatsoulis and michalpristas November 13, 2024 14:47

pkoutsovasilis approved these changes Nov 13, 2024

View reviewed changes

swiatekm added the backport-8.16 Automated backport with mergify label Nov 13, 2024

blakerouse requested changes Nov 13, 2024

View reviewed changes

internal/pkg/composable/providers/kubernetes/pod.go Outdated Show resolved Hide resolved

swiatekm force-pushed the fix/k8sprovider/pod-update-running branch from e1d996f to 4392d09 Compare November 14, 2024 10:46

swiatekm requested a review from blakerouse November 14, 2024 10:47

blakerouse approved these changes Nov 14, 2024

View reviewed changes

Emit Pod data only for running Pods in the Kubernetes provider

0dd8253

swiatekm force-pushed the fix/k8sprovider/pod-update-running branch from 4392d09 to 0dd8253 Compare November 18, 2024 10:57

gizas approved these changes Nov 18, 2024

View reviewed changes

swiatekm added the backport-8.15 Automated backport to the 8.15 branch with mergify label Nov 18, 2024

swiatekm merged commit 434986e into main Nov 18, 2024
15 checks passed

swiatekm deleted the fix/k8sprovider/pod-update-running branch November 18, 2024 12:57

mergify bot pushed a commit that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011)

087afeb

(cherry picked from commit 434986e)

mergify bot pushed a commit that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011)

f257502

(cherry picked from commit 434986e)

mergify bot pushed a commit that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011)

6c4f492

(cherry picked from commit 434986e)

blakerouse pushed a commit to blakerouse/elastic-agent that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (elast…

39adecf

…ic#6011)

swiatekm added a commit that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011) (

73fc574

#6056) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

swiatekm added a commit that referenced this pull request Nov 18, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011) (

b388809

#6054) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

swiatekm added a commit that referenced this pull request Nov 21, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011)

5db0d4e

(cherry picked from commit 434986e)

swiatekm added a commit that referenced this pull request Nov 21, 2024

Emit Pod data only for running Pods in the Kubernetes provider (#6011) (

37247fa

#6055) (cherry picked from commit 434986e) Co-authored-by: Mikołaj Świątek <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Emit Pod data only for running Pods in the Kubernetes provider #6011

Emit Pod data only for running Pods in the Kubernetes provider #6011

Uh oh!

swiatekm commented Nov 13, 2024 •

edited

Loading

Uh oh!

mergify bot commented Nov 13, 2024

Uh oh!

mergify bot commented Nov 13, 2024

Uh oh!

pkoutsovasilis left a comment

Uh oh!

gizas commented Nov 13, 2024 •

edited

Loading

Uh oh!

swiatekm commented Nov 13, 2024 •

edited

Loading

Uh oh!

blakerouse commented Nov 13, 2024

Uh oh!

swiatekm commented Nov 13, 2024

Uh oh!

Uh oh!

blakerouse left a comment

Uh oh!

elastic-sonarqube bot commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

Emit Pod data only for running Pods in the Kubernetes provider #6011

Emit Pod data only for running Pods in the Kubernetes provider #6011

Uh oh!

Conversation

swiatekm commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Uh oh!

mergify bot commented Nov 13, 2024

Uh oh!

mergify bot commented Nov 13, 2024

Uh oh!

pkoutsovasilis left a comment

Choose a reason for hiding this comment

Uh oh!

gizas commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swiatekm commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blakerouse commented Nov 13, 2024

Uh oh!

swiatekm commented Nov 13, 2024

Uh oh!

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Nov 18, 2024

Quality Gate passed

Uh oh!

Uh oh!

Uh oh!

swiatekm commented Nov 13, 2024 •

edited

Loading

gizas commented Nov 13, 2024 •

edited

Loading

swiatekm commented Nov 13, 2024 •

edited

Loading