Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove status metrics for deleted resources #10597

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ryanrolds
Copy link

@ryanrolds ryanrolds commented Feb 4, 2025

Description

The status metrics for resources are left around after the resource is deleted. If the resource was invalid, the metric would continue to report invalid causing false alarms that required restarting the controller to clear.

Sleeping for 1 second was required to avoid a race condition that resulted in the new metrics getting caught up in the clearing/unregistering. I don't like adding the sleep, but that only other option is upgrading the metrics package we are using to a newer supported one (which doesn't have the race condition, the other package use a mutex).

When we upgrade to a new package we can remove this. The E2E test should have long-term value.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
kgateway-dev#6938

@ryanrolds ryanrolds changed the title Clear resource metrics to remove metrics for deleted resources Remove status metrics for deleted resources Feb 5, 2025
@ryanrolds ryanrolds marked this pull request as ready for review February 5, 2025 17:22
@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick-ci

Copy link

github-actions bot commented Feb 5, 2025

Visit the preview URL for this PR (updated for commit 79164e8):

https://gloo-edge--pr10597-rolds-resource-metri-eq3r3xf3.web.app

(expires Fri, 14 Feb 2025 17:25:49 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@ryanrolds
Copy link
Author

/kick-ci

@ryanrolds
Copy link
Author

/kick

Copy link
Collaborator

@nfuden nfuden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

If an extra example can be added to prove otherwise then I am onboard

pkg/utils/statsutils/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/utils/statsutils/metrics/metrics.go Outdated Show resolved Hide resolved
Comment on lines +178 to +193
// Iterate through the resource metrics and unregister them
for _, metric := range m.metrics {
v := view.Find(metric.gauge.Name())
if v != nil {
view.Unregister(v)
someViewsUnregistered = true
}
}

// Only sleep when some metrics were unregistered
if someViewsUnregistered {
// Wait for the view to be unregistered (a channel is used)
// This is necessary because the view is unregistered asynchronously.
// We may not need this after we upgrade to an newer metrics package
time.Sleep(1 * time.Second)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Iterate through the resource metrics and unregister them
for _, metric := range m.metrics {
v := view.Find(metric.gauge.Name())
if v != nil {
view.Unregister(v)
someViewsUnregistered = true
}
}
// Only sleep when some metrics were unregistered
if someViewsUnregistered {
// Wait for the view to be unregistered (a channel is used)
// This is necessary because the view is unregistered asynchronously.
// We may not need this after we upgrade to an newer metrics package
time.Sleep(1 * time.Second)
}
// Iterate through the resource metrics and determine which need to be deregistered
toUnregister := []*view.View{}
for _, metric := range m.metrics {
v := view.Find(metric.gauge.Name())
if v != nil {
toUnregister = append(toUnregister, v)
}
}
// Unregister any found views
// This batches all the requests and then blocks until they are complete!
view.Unregister(toUnregister...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

view.Unregister is a metric level operation. It will remove all the metrics we just added.

@@ -402,6 +408,10 @@ func (s *statusSyncer) extractCurrentReports() (reporter.ResourceReports, map[re
func (s *statusSyncer) syncStatus(ctx context.Context) error {
allReports, inputResourceBySubresourceStatuses, localInputResourceLastStatus := s.extractCurrentReports()

// reset the resource metrics before we write the new statuses, this ensures that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant this cause flickering to an operator?
Why cant we set new statuses and then clear all out of compliance ones after?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not able to clear out metrics at the label/attribute level. Only at the metric-level (view), which includes all labels).

@ryanrolds
Copy link
Author

I dont quite understand the rationale for clearing all statuses rather than doing something a bit more targeted and am concerned it could cause some flicker in metrics.

The API provided by the metrics packages just don't provide the level of access needed to filter/remove specific label combinations from the metrics/views. I wish it wasn't that way. The same issue exists in the package we would like to move to, but it won't require a sleep reducing the data race we are both concerned about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants