Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation:
The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage.
On our dashboards it looks like this:
Previously I thought this was just due to undesirably-low weights, and #6238 addressed that (and is still a useful addition).
After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were at least identical, and likely larger.
Which drove me to re-read the details and think harder. And then I found this PR's issue.
Issue and fix
What was happening is that the initial
rate.NewLimiter(0,0)
detail was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit.This happens because
(0, 0)
creates a zero-burst limit on the "primary" limiter, and the shadowed.Allow()
calls were advancing the limiter's internal "now" value...... and then when the limit and burst were increased, the limiter would have to fill from zero.
This put it in a worse position than local / fallback limiters, which start from
(local, local)
with a zero "now" value, and then the next.Allow()
is basically guaranteed to fill the token bucket due to many years "elapsing".So the fix has two parts:
1: Avoid advancing the zero-valued limiter's internal time until a reasonable limit/burst has been set.
This is done by simply not calling it while in startup mode.
2: Avoid advancing limiters' time when setting limit and burst.
This means that after an idle period ->
Update()
->Allow()
, tokens will fill as if the new values were set all along, and the setters can be called in any order.The underlying
rate.Limiter
does not do the second, it advances time when setting these... but that seems undesirable.It means old values are preferred (which is reasonable, they were set when that time passed), and it means that the order you call to set both burst and limit has a significant impact on the outcome, even with the same values and the same timing: time passes only on the first call, the second has basically zero elapsed and has no immediate effect at all (unless lowering burst). I can only see that latter part as surprising, and definitely worth avoiding.
Alternative approach
2 seems worth keeping. But 1 has a relatively clear alternative:
Don't create the "primary" limiter until the first
Update()
.Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that.
If I were to do this, I would just switch to a mutex, the
rate.Limiter
already uses them so it should be near zero cost.I'm happy to build that if someone prefers, I just didn't bother this time.