Skip to content

cert-manager infinite retries without backoff #112

@Kumm-Kai

Description

@Kumm-Kai

What happened?

When using expired credentials or having other misconfigurations in stackit-cert-manager-webhook cert-manager will infinitely retry (without any exponential backoff!) the Challenge object.
This happens in all error cases where the returned error contains any changing details (e.g., timestamp, request ID, ...), as the error returned by the webhook is persisted in the Challenge object (status.reason) and cert-manager reconciles the entire Challenge object if it detects any change (including the entire .status!).

This is also stated here in a comment inside the acmechallenges Sync function. Sadly, this entire thing isn't properly documented anywhere else 😔

How can we reproduce this?

We've noticed this when someone tried to use a removed service account key, so:

  1. Create new project
  2. Create a new service account (no need to actually add it to a project)
  3. Create a new service account key (persist it for later use and delete it again)
  4. Deploy cert-manager
  5. Deploy stackit-cert-manager-webhook (helm install stackit-cert-manager-webhook -n cert-manager stackit-cert-manager-webhook/stackit-cert-manager-webhook --set stackitSaAuthentication.enabled=true and create the cert-manager/stackit-sa-authentication secret)
  6. Create an issuer and certificate

Observe the issue:

  1. Check the events of the Challenge resource
  2. kubectl get challenges.acme.cert-manager.io -w (see ~4 changes per second)
  3. Check the cert-manager logs
  4. Check the stackit-cert-manager-webhook logs

Additional context

To properly fix this, we must sanitize every error case where we don't have control of the error. We can still log the "original" error, so we should just state the general thing that failed and optionally tell the user that they should check the stackit-cert-manager-webhook logs (e.g., "failed fetching zone. See the stackit-cert-manager-webhook logs for more details.").

Search

  • I did search for other open and closed issues before opening this.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions