-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove status metrics for deleted resources #10597
base: main
Are you sure you want to change the base?
Conversation
Issues linked to changelog: |
/kick-ci |
/kick-ci |
Visit the preview URL for this PR (updated for commit 79164e8): https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app (expires Fri, 14 Feb 2025 17:25:49 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
/kick-ci |
/kick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.
If an extra example can be added to prove otherwise then I am onboard
// Iterate through the resource metrics and unregister them | ||
for _, metric := range m.metrics { | ||
v := view.Find(metric.gauge.Name()) | ||
if v != nil { | ||
view.Unregister(v) | ||
someViewsUnregistered = true | ||
} | ||
} | ||
|
||
// Only sleep when some metrics were unregistered | ||
if someViewsUnregistered { | ||
// Wait for the view to be unregistered (a channel is used) | ||
// This is necessary because the view is unregistered asynchronously. | ||
// We may not need this after we upgrade to an newer metrics package | ||
time.Sleep(1 * time.Second) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Iterate through the resource metrics and unregister them | |
for _, metric := range m.metrics { | |
v := view.Find(metric.gauge.Name()) | |
if v != nil { | |
view.Unregister(v) | |
someViewsUnregistered = true | |
} | |
} | |
// Only sleep when some metrics were unregistered | |
if someViewsUnregistered { | |
// Wait for the view to be unregistered (a channel is used) | |
// This is necessary because the view is unregistered asynchronously. | |
// We may not need this after we upgrade to an newer metrics package | |
time.Sleep(1 * time.Second) | |
} | |
// Iterate through the resource metrics and determine which need to be deregistered | |
toUnregister := []*view.View{} | |
for _, metric := range m.metrics { | |
v := view.Find(metric.gauge.Name()) | |
if v != nil { | |
toUnregister = append(toUnregister, v) | |
} | |
} | |
// Unregister any found views | |
// This batches all the requests and then blocks until they are complete! | |
view.Unregister(toUnregister...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
view.Unregister
is a metric level operation. It will remove all the metrics we just added.
@@ -402,6 +408,10 @@ func (s *statusSyncer) extractCurrentReports() (reporter.ResourceReports, map[re | |||
func (s *statusSyncer) syncStatus(ctx context.Context) error { | |||
allReports, inputResourceBySubresourceStatuses, localInputResourceLastStatus := s.extractCurrentReports() | |||
|
|||
// reset the resource metrics before we write the new statuses, this ensures that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cant this cause flickering to an operator?
Why cant we set new statuses and then clear all out of compliance ones after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not able to clear out metrics at the label/attribute level. Only at the metric-level (view), which includes all labels).
Co-authored-by: Nathan Fudenberg <[email protected]>
Co-authored-by: Nathan Fudenberg <[email protected]>
The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about. |
Description
The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.
Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).
When we upgrade to a new package we can remove this. The E2E test should have long-term value.
Checklist: