-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit Pod data only for running Pods in the Kubernetes provider #6011
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
|
|
adeed61
to
e1d996f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me; LGTM
@swiatekm thanks for this. This makes sense. I want to make sure some corner cases like:
I guess this is a big change and before merging I want like some manual e2e tests here to be included |
There isn't really any difference. It's possible for the Pod to be finished running before we update the configuration, but this is possible right now too. I'm not sure if this is a problem either way - scrapers probably wouldn't get any metrics, and container logs should remain for filebeat to read until the Pod is deleted.
It should be noted that this PR changes when the metadata stored by the kubernetes provider is updated, and when notifications are delivered to the coordinator. Once we store a Pod's metadata once (that is, it becomes Running at any point), we'll continue to include it until it gets deleted. If the container gets OOMKilled in the meantime, we'll simply continue using the same metadata we already have. If there's an image pulling problem, the Pod will never start any of its containers, and we'll never have metadata about it.
What kind of scenarios would you like me to test? I've already done the happy paths with the current standalone manifest. |
We actually delay the removal of a pod after it is stopped to ensure that it exists for a period of time. This is needed for reading the log files for a reason (think nginx access logs from inside of the container). I think this should handle the case where the status is basically greater than > Running. As in crashed, stopping, stopped, etc. |
For the record, I considered other approaches to this problem, but trying to debounce the updates in a more clever way made things more difficult to reason about and ultimately not much better than the current solution. |
e1d996f
to
4392d09
Compare
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the change.
What does this PR do?
The kubernetes provider emits data each time a Pod gets updated, even if the Pod is not yet running. As a result, every time a new Pod is spawned, we get multiple updates as it is created, scheduled, containers are created, and so on. Instead, emit the data only if the Pod is actually running.
Why is it important?
Configuration reloading can be quite expensive when there are a lot of Pods on the Node. We should avoid doing so unnecessarily. This change should help #5835 and #5991.
Note that in principle this exposes us to a new failure mode. It's possible for the Pod to successfully finish running and be removed before we push the new configuration to beats. This was much less likely when we included the Pod when it was scheduled, but still possible with our 100ms debounce timer in the coordinator. I think the tradeoff is worth it, considering the issues with config reloading on large Nodes.
Checklist
./changelog/fragments
using the changelog toolHow to test this PR locally
Deploy the agent in a local kind cluster using either the default manifests or the Helm Chart.
Related issues