Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informer controller should have an exponential backoff in case of misconfigured finalizer #559

Open
charandas opened this issue Dec 19, 2024 · 1 comment

Comments

@charandas
Copy link
Contributor

I ran into a situation with Playlists, when switching from the OpinionatedWatcher implementation to OpinionatedReconciler, wherein I had a resource already saved with a finalizer named like below:

apiVersion: v1
items:
- apiVersion: playlist.grafana.app/v0alpha1
  kind: Playlist
  metadata:
    finalizers:
    - playlist-playlists-finalizer

When I ran the OpinionatedReconciler version with the finalizer not matching the existing, the InformerController code repeatedly attempted to perform these lines and propagating the NotFound error in a way that I saw the below logs, which are misleading:

DEBUG[12-19|12:17:25] grafana-aggregator: "/apis/playlist.grafana.app/v0alpha1/namespaces/stacks-1/playlists/adkj6k6" satisfied by NotFoundHandler logger=grafana-apiserver

Perhaps we can improve the logic in 2 ways:

  1. Warn the user about the misconfiguration, one can only hope that they don't have mixed objects at this opinion with 2 sets of finalizer values
  2. Alternatively, offer a way of fixing the issue by resorting to the newer value for existing objects - not sure how to intelligently do so without user hint

Additionally, we need to fix the infinite try loop that is sure to peg the processor:

  1. Do this by performing the retry logic within limits with exponential backoff
@IfSentient
Copy link
Contributor

Errors returned from a reconciler should be retried based on the RetryPolicy for the InformerController, which, by default, is an Exponential Backoff policy. The only way I see a tight loop happening here is a bad RetryPolicy, or (more likely), continuous reconcile events triggered by patches (in which case, the patch would have been successful, and the reconciler should have been removed?)

The linked code is for a deleted resources (deletionTimestamp is non-nil), did this happen on a delete? I think I see the path here on what might be happening: the resource is in a "deleted" state (deletionTimestamp is non-nil), but we attempt to add the finalizer because it looks like we don't check if the resource is pending deletion in that condition. This should in theory fail, as finalizers cannot be added once deletionTimestamp is set, but if it succeeded, or if that triggered another reconciler event for any reason, then we'd either then remove the finalizer next, or try to add the finalizer again each time we get a reconcile event. This could then keep re-triggering in a loop of reconcile events.

How frequently did this error appear in the logs? It is also possible it could be coming from cache re-syncs, as those were set to 30s prior to v0.22.0, though it looks like the playlists app is using v0.23.1, so that seems less likely.

I think either way more logging and tracing in the OPinionatedReconciler at the very least matching the debug logging in the OpinionatedWatcher would be helpful here, I'll get a PR in to add that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants