Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Implement backoff strategy for getting tokens from Operate #565

Open
markfarkas-camunda opened this issue Dec 18, 2023 · 1 comment
Open
Milestone

Comments

@markfarkas-camunda
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
In SaaS environment we use rate-limiter mechanism, which can cause serious problems for us. Connectors try to get token (to be able to poll from Operate), but this can lead to 429 Too Many Requests because of the rate-limiter. Once this happens we can get into an infinite loop where all the connector runtime tries to fetch the token and we keep getting 429 responses. The reason why it can occur is that rate-limiting happens globally per regions and nor per cluster. See: https://github.com/camunda-cloud/team-sre/issues/545 We have observed this on DEV but this issue can occur on any environment.

Describe the solution you'd like
Add backoff strategy strategy for failed requests: increase the interval of getting tokens after each failed request, to prevent bombarding the /oauth/token endpoint.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Without this solution we can easily get into an infinite loop trying to get new tokens and always hitting the rate limit in SaaS.

@1nb0und 1nb0und added this to the 8.4.0 milestone Dec 18, 2023
@1nb0und 1nb0und self-assigned this Dec 18, 2023
@spalberg
Copy link

We also observed this multiple times even without using connectors. We then had to scale down all our job worker deployments in all our clusters to mitigate it, which resulted in prod downtimes.

@1nb0und 1nb0und modified the milestones: 8.4.0, 8.4.1 Jan 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants