Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Groxx · 2024-09-12T02:10:36Z

Motivation:

The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage.
On our dashboards it looks like this:

Previously I thought this was just due to undesirably-low weights, and #6238 addressed that (and is still a useful addition).

After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were at least identical, and likely larger.

Which drove me to re-read the details and think harder. And then I found this PR's issue.

Issue and fix

What was happening is that the initial rate.NewLimiter(0,0) detail was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit.

This happens because (0, 0) creates a zero-burst limit on the "primary" limiter, and the shadowed .Allow() calls were advancing the limiter's internal "now" value...
... and then when the limit and burst were increased, the limiter would have to fill from zero.

This put it in a worse position than local / fallback limiters, which start from (local, local) with a zero "now" value, and then the next .Allow() is basically guaranteed to fill the token bucket due to many years "elapsing".

So the fix has two parts:

1: Avoid advancing the zero-valued limiter's internal time until a reasonable limit/burst has been set.
This is done by simply not calling it while in startup mode.

2: Avoid advancing limiters' time when setting limit and burst.
This means that after an idle period -> Update() -> Allow(), tokens will fill as if the new values were set all along, and the setters can be called in any order.

The underlying rate.Limiter does not do the second, it advances time when setting these... but that seems undesirable.
It means old values are preferred (which is reasonable, they were set when that time passed), and it means that the order you call to set both burst and limit has a significant impact on the outcome, even with the same values and the same timing: time passes only on the first call, the second has basically zero elapsed and has no immediate effect at all (unless lowering burst). I can only see that latter part as surprising, and definitely worth avoiding.

Alternative approach

2 seems worth keeping. But 1 has a relatively clear alternative:
Don't create the "primary" limiter until the first Update().

Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that.
If I were to do this, I would just switch to a mutex, the rate.Limiter already uses them so it should be near zero cost.
I'm happy to build that if someone prefers, I just didn't bother this time.

…r new ratelimiters # Motivation: The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage. Previously it was thought this was just due to irrationally-low weights, and uber#6238 addressed that (and is still desirable). After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and possibly larger. And then I found this PR's issue. # Issue and fix What was happening is that the initial `rate.NewLimiter(0,0)` was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit. This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value... ... and then when the limit and burst were increased, the limiter would have to fill from zero. This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing". The fix has two parts: 1: Avoid advancing the un-initialized limiter's internal time until a reasonable limit/burst has been set. This is done by simply not calling it while in startup mode. 2: Avoid advancing limiters' time when setting limit and burst. This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if they were set all along, and the setters can be called in any order. The underlying `rate.Limiter` does *not* do this, it advances time when setting these... but that seems undesirable. It means old values are preferred (which is reasonable - they were set when that time passed), *and* it means that the order you call these has a significant impact on the outcome, even with the same values and the same timing. I can only see that as surprising, and worth avoiding. # Alternative approach 2 seems worth keeping. But 1 has a relatively clear alternative: Don't create the "primary" limiter until the first `Update()`. Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that. If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost. I'm happy to build that if someone prefers, I just didn't bother this time.

davidporter-id-au

Ty for the comments, they're certainly helpful

codecov · 2024-09-12T02:39:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.10%. Comparing base (e5bd91e) to head (b054418).
Report is 1 commits behind head on master.

Additional details and impacted files

Files with missing lines	Coverage Δ
common/clock/ratelimiter.go	`100.00% <100.00%> (ø)`
...mmon/quotas/global/collection/internal/fallback.go	`96.66% <100.00%> (+0.17%)`	⬆️

... and 5 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5bd91e...b054418. Read the comment docs.

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners September 12, 2024 02:10

davidporter-id-au approved these changes Sep 12, 2024

View reviewed changes

minor comment

b054418

Groxx enabled auto-merge (squash) September 12, 2024 02:47

Groxx merged commit 04add2d into uber:master Sep 12, 2024
20 checks passed

Groxx deleted the limiter-polish branch September 12, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Groxx commented Sep 12, 2024 •

edited

Loading

davidporter-id-au left a comment

codecov bot commented Sep 12, 2024 •

edited

Loading

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Conversation

Groxx commented Sep 12, 2024 • edited Loading

Motivation:

Issue and fix

Alternative approach

davidporter-id-au left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 12, 2024 • edited Loading

Codecov Report

Groxx commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading