-
Notifications
You must be signed in to change notification settings - Fork 435
Add failure modes telemetry #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failure modes telemetry #646
Conversation
Issue #509 |
@rdimitrov @domdomegg @tadasant Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option. ![]() |
I believe what we have now is what you describe - all commits to main get deployed to staging and only the latest release gets deployed to prod. Would that work for you or perhaps you'll need something else? |
@rdimitrov That works assuming once we verify the above details then only release happens for production env, basically we might have to wait for other release to be deployed meanwhile. Just being cautious here, looking at workflow files, looks like everything is getting deployed on production, anywhere that I can check the |
@rdimitrov Thank you for clearing this out 🙂 I should have clarified this earlier, my bad. I am not looking at release process for specific tag, I wanted to get the stats for this branch before this gets deployed to production. In |
Ah yeah, this is how it used to work. But we changed it a couple weeks ago so that deploy only deploys main to staging by default. And to get to production you need to create a GitHub release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to try this in staging!
Super @domdomegg I will do a final testing on my end once again and we can then take it to staging. |
Should get deployed to staging automatically :) |
Thank you @domdomegg |
Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints.
Motivation and Context
When there is any issue with registry container we should be notified.
How Has This Been Tested?
Breaking Changes
Types of changes
Checklist
Additional context
No additional exporter is used, taken advantage of opentelemetry collector
It covers metrics related to resource constraints, currently only limited to default namespace.
Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace.
Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering.
Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes.
Shipping of metrics for resources happens every 60s and list of metrics that will be emitted https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml
Container errors