fix(router): avoid blocking token validation across JWKS providers#2974
fix(router): avoid blocking token validation across JWKS providers#2974ebachle wants to merge 1 commit into
Conversation
When more than one JWKS provider is configured and refresh_unknown_kid is enabled, a token signed by any provider other than the first-listed one incurred an avoidable blocking delay on the request path. The decoder tried each provider sequentially, and a provider that did not own the token's kid would block on its rate limiter and perform a live HTTP refresh (bounded by max_wait, default 2m) before reporting "key not found" and letting the next provider try. Because providers have disjoint kid sets by design, a token owned by a later provider always missed earlier providers' caches and paid that penalty, scaling with the number of providers listed before the owner. Add a non-blocking, refresh-free cache pre-pass to keyFuncWrapper: when the token carries a kid, consult each provider's in-memory storage directly. If a provider already holds the kid (the steady state for known/rotated keys), validate against it and return without ever touching another provider's rate limiter or slow endpoint. Tokens without a kid, and kids no provider has cached, fall through to the existing sequential loop, so genuinely-unknown kids keep their current behavior and HTTP refresh volume is unchanged. The per-entry audience/algorithm validation is extracted into tryEntry so the pre-pass and the fallback loop share identical semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WalkthroughAdds a ChangesNon-blocking KID cache pre-pass for multi-provider JWKS
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
router-tests/security/authentication_test.go (1)
1672-1673: 💤 Low valueConsider tightening the timing assertion for stronger test confidence.
With both servers set to 1-second respond times and sequential fallback execution, the expected elapsed time is ~2 seconds. The current assertion
>= 700msis loose enough that it would pass even if only one provider was queried. A tighter bound (e.g.,>= 1500*time.Millisecond) would provide stronger evidence that both providers were actually invoked.That said, the
http.StatusUnauthorizedcheck does confirm the overall behavior, so this is optional.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@router-tests/security/authentication_test.go` around lines 1672 - 1673, The timing assertion in the require.GreaterOrEqual call for the elapsed duration is too loose with a 700 millisecond threshold. Since both servers are configured with 1-second response times and execute sequentially, the expected total elapsed time should be around 2 seconds, meaning the assertion should be tightened to approximately 1500 milliseconds or higher to properly verify that both providers are actually being invoked and not just one. Update the 700*time.Millisecond value in the GreaterOrEqual assertion to a tighter bound that better reflects the expected sequential execution of multiple providers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@router-tests/security/authentication_test.go`:
- Around line 1672-1673: The timing assertion in the require.GreaterOrEqual call
for the elapsed duration is too loose with a 700 millisecond threshold. Since
both servers are configured with 1-second response times and execute
sequentially, the expected total elapsed time should be around 2 seconds,
meaning the assertion should be tightened to approximately 1500 milliseconds or
higher to properly verify that both providers are actually being invoked and not
just one. Update the 700*time.Millisecond value in the GreaterOrEqual assertion
to a tighter bound that better reflects the expected sequential execution of
multiple providers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3d10faf4-4085-4c03-9633-35ccd7028101
📒 Files selected for processing (2)
router-tests/security/authentication_test.gorouter/pkg/authentication/jwks_token_decoder.go
Summary
When the router is configured with more than one JWKS provider under
authentication.jwt.jwksandrefresh_unknown_kidis enabled, a token signed by any provider other than the first-listed one incurred an avoidable, rate-limiter-bounded blocking delay on the request path — up torefresh_unknown_kid.max_wait(default 2m) per request under sustained load.Why it happened
keyFuncWrapperinrouter/pkg/authentication/jwks_token_decoder.govalidated a token by trying each configured provider sequentially, in config order. On a kid cache miss withrefresh_unknown_kidenabled, the underlyingjwksetclient blocks on its rate limiter and performs a live HTTP refresh before reporting "key not found" and letting the next provider try.Because two providers with separate identity sources have disjoint
kidsets by design, a token signed by provider #2 always missed provider #1's cache and tripped provider #1's blocking unknown-kid path. The cost scaled with the number of providers listed before the one that owns the key.The fix
Add a non-blocking, refresh-free cache pre-pass to
keyFuncWrapper. When a token carries akid, each provider's in-memory storage is consulted directly (Storage.KeyRead, an in-memory lookup that never triggers the rate-limited refresh). If a provider already holds thekid— the steady state for known / already-rotated keys — the token is validated against that provider and returned without ever touching another provider's rate limiter or slow endpoint.Tokens without a
kid, and kids that no provider has cached, fall through to the existing sequential loop, so:The per-entry audience/algorithm validation is extracted into a
tryEntryhelper so the pre-pass and the fallback loop share identical semantics.How to test
New integration tests live in
router-tests/security/authentication_test.gounderTestHttpJwksAuthorization.The tests drive the latency deterministically: a non-owning provider is made slow via
Server.SetRespondTime, withmax_wait > intervalso the limiter proceeds into the (slow) refresh rather than short-circuiting. The measured request asserts validation of a later-provider key completes in< 500ms.Notes
iss-based provider routing (also fixes the genuinely-unknown-kid and garbage-token cases) and an opt-in async "trigger-and-return" refresh. Both would change config surface and/or semantics and are better tracked separately.🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
Bug Fixes