Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit Pod data only for running Pods in the Kubernetes provider #6011

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Nov 13, 2024

What does this PR do?

The kubernetes provider emits data each time a Pod gets updated, even if the Pod is not yet running. As a result, every time a new Pod is spawned, we get multiple updates as it is created, scheduled, containers are created, and so on. Instead, emit the data only if the Pod is actually running.

Why is it important?

Configuration reloading can be quite expensive when there are a lot of Pods on the Node. We should avoid doing so unnecessarily. This change should help #5835 and #5991.

Note that in principle this exposes us to a new failure mode. It's possible for the Pod to successfully finish running and be removed before we push the new configuration to beats. This was much less likely when we included the Pod when it was scheduled, but still possible with our 100ms debounce timer in the coordinator. I think the tradeoff is worth it, considering the issues with config reloading on large Nodes.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Deploy the agent in a local kind cluster using either the default manifests or the Helm Chart.

Related issues

@swiatekm swiatekm added the enhancement New feature or request label Nov 13, 2024
Copy link
Contributor

mergify bot commented Nov 13, 2024

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 13, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 13, 2024
@swiatekm swiatekm force-pushed the fix/k8sprovider/pod-update-running branch from adeed61 to e1d996f Compare November 13, 2024 14:46
@swiatekm swiatekm marked this pull request as ready for review November 13, 2024 14:47
@swiatekm swiatekm requested review from a team as code owners November 13, 2024 14:47
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me; LGTM

@gizas
Copy link
Contributor

gizas commented Nov 13, 2024

@swiatekm thanks for this. This makes sense.

I want to make sure some corner cases like:

  • pods created from cronjobs
  • pods that are restarting (eg. OOM or Imagepull error) how this will behave in such cases?

I guess this is a big change and before merging I want like some manual e2e tests here to be included

@swiatekm
Copy link
Contributor Author

swiatekm commented Nov 13, 2024

@swiatekm thanks for this. This makes sense.

I want to make sure some corner cases like:

* pods created from cronjobs

There isn't really any difference. It's possible for the Pod to be finished running before we update the configuration, but this is possible right now too. I'm not sure if this is a problem either way - scrapers probably wouldn't get any metrics, and container logs should remain for filebeat to read until the Pod is deleted.

* pods that are restarting (eg. OOM or Imagepull error) how this will behave in such cases?

It should be noted that this PR changes when the metadata stored by the kubernetes provider is updated, and when notifications are delivered to the coordinator. Once we store a Pod's metadata once (that is, it becomes Running at any point), we'll continue to include it until it gets deleted.

If the container gets OOMKilled in the meantime, we'll simply continue using the same metadata we already have.

If there's an image pulling problem, the Pod will never start any of its containers, and we'll never have metadata about it.

I guess this is a big change and before merging I want like some manual e2e tests here to be included

What kind of scenarios would you like me to test? I've already done the happy paths with the current standalone manifest.

@blakerouse
Copy link
Contributor

@swiatekm thanks for this. This makes sense.
I want to make sure some corner cases like:

* pods created from cronjobs

There isn't really any difference. It's possible for the Pod to be finished running before we update the configuration, but this is possible right now too. I'm not sure if this is a problem either way - scrapers probably wouldn't get any metrics, and container logs should remain for filelog to read until the Pod is deleted.

We actually delay the removal of a pod after it is stopped to ensure that it exists for a period of time. This is needed for reading the log files for a reason (think nginx access logs from inside of the container). I think this should handle the case where the status is basically greater than > Running. As in crashed, stopping, stopped, etc.

@swiatekm swiatekm added the backport-8.16 Automated backport with mergify label Nov 13, 2024
@swiatekm
Copy link
Contributor Author

For the record, I considered other approaches to this problem, but trying to debounce the updates in a more clever way made things more difficult to reason about and ultimately not much better than the current solution.

@swiatekm swiatekm force-pushed the fix/k8sprovider/pod-update-running branch from e1d996f to 4392d09 Compare November 14, 2024 10:46
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants