Skip to content

fix(router): avoid blocking token validation across JWKS providers#2974

Open
ebachle wants to merge 1 commit into
wundergraph:mainfrom
ebachle:fix/jwks-multi-provider-unknown-kid-blocking
Open

fix(router): avoid blocking token validation across JWKS providers#2974
ebachle wants to merge 1 commit into
wundergraph:mainfrom
ebachle:fix/jwks-multi-provider-unknown-kid-blocking

Conversation

@ebachle

@ebachle ebachle commented Jun 15, 2026

Copy link
Copy Markdown

Summary

When the router is configured with more than one JWKS provider under authentication.jwt.jwks and refresh_unknown_kid is enabled, a token signed by any provider other than the first-listed one incurred an avoidable, rate-limiter-bounded blocking delay on the request path — up to refresh_unknown_kid.max_wait (default 2m) per request under sustained load.

Why it happened

keyFuncWrapper in router/pkg/authentication/jwks_token_decoder.go validated a token by trying each configured provider sequentially, in config order. On a kid cache miss with refresh_unknown_kid enabled, the underlying jwkset client blocks on its rate limiter and performs a live HTTP refresh before reporting "key not found" and letting the next provider try.

Because two providers with separate identity sources have disjoint kid sets by design, a token signed by provider #2 always missed provider #1's cache and tripped provider #1's blocking unknown-kid path. The cost scaled with the number of providers listed before the one that owns the key.

The fix

Add a non-blocking, refresh-free cache pre-pass to keyFuncWrapper. When a token carries a kid, each provider's in-memory storage is consulted directly (Storage.KeyRead, an in-memory lookup that never triggers the rate-limited refresh). If a provider already holds the kid — the steady state for known / already-rotated keys — the token is validated against that provider and returned without ever touching another provider's rate limiter or slow endpoint.

Tokens without a kid, and kids that no provider has cached, fall through to the existing sequential loop, so:

  • genuinely-unknown kids keep their current behavior (still rate-limited + refreshed), and
  • JWKS HTTP refresh volume is unchanged (the rate limiter still governs real refreshes).

The per-entry audience/algorithm validation is extracted into a tryEntry helper so the pre-pass and the fallback loop share identical semantics.

How to test

New integration tests live in router-tests/security/authentication_test.go under TestHttpJwksAuthorization.

cd router-tests

# The two regression cases. On the previous behavior these FAIL on latency
# (~2s for a 2-provider setup, ~4s for 3 providers); with this change they pass in <500ms.
go test ./security/ -run 'TestHttpJwksAuthorization/(known_kid_owned_by_a_later_provider|known_kid_owned_by_the_third_provider)' -v -count=1

# Guards that behavior is otherwise unchanged:
#  - a key owned by the FIRST provider still validates fast (no regression)
#  - a genuinely-unknown kid still blocks (>=700ms), i.e. semantics preserved
go test ./security/ -run 'TestHttpJwksAuthorization/(known_kid_owned_by_the_first_provider|genuinely_unknown_kid_still_blocks)' -v -count=1

# Full auth suite — no regressions:
go test ./security/ -run 'TestAuthentication|TestHttpJwksAuthorization|TestNonHttpAuthorization|TestAudienceValidation|TestAlgorithmMismatch|TestSupportedAlgorithms|TestUseCustomization|TestMultipleKeys|TestOidcDiscovery|TestIntrospectionAuthentication' -count=1

The tests drive the latency deterministically: a non-owning provider is made slow via Server.SetRespondTime, with max_wait > interval so the limiter proceeds into the (slow) refresh rather than short-circuiting. The measured request asserts validation of a later-provider key completes in < 500ms.

Notes

  • No config surface change. This is an internal latency fix; existing YAML/behavior for users is unchanged apart from the removed blocking.
  • Follow-ups discussed but intentionally out of scope here: issuer/iss-based provider routing (also fixes the genuinely-unknown-kid and garbage-token cases) and an opt-in async "trigger-and-return" refresh. Both would change config surface and/or semantics and are better tracked separately.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Added comprehensive test cases for authentication behavior with multiple token providers and variable endpoint performance.
  • Bug Fixes

    • Improved token validation efficiency with optimized provider lookup and caching, reducing unnecessary blocking operations when handling multiple authentication providers.

When more than one JWKS provider is configured and refresh_unknown_kid is
enabled, a token signed by any provider other than the first-listed one
incurred an avoidable blocking delay on the request path. The decoder tried
each provider sequentially, and a provider that did not own the token's kid
would block on its rate limiter and perform a live HTTP refresh (bounded by
max_wait, default 2m) before reporting "key not found" and letting the next
provider try. Because providers have disjoint kid sets by design, a token
owned by a later provider always missed earlier providers' caches and paid
that penalty, scaling with the number of providers listed before the owner.

Add a non-blocking, refresh-free cache pre-pass to keyFuncWrapper: when the
token carries a kid, consult each provider's in-memory storage directly. If a
provider already holds the kid (the steady state for known/rotated keys),
validate against it and return without ever touching another provider's rate
limiter or slow endpoint. Tokens without a kid, and kids no provider has
cached, fall through to the existing sequential loop, so genuinely-unknown
kids keep their current behavior and HTTP refresh volume is unchanged.

The per-entry audience/algorithm validation is extracted into tryEntry so the
pre-pass and the fallback loop share identical semantics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ebachle ebachle requested a review from a team as a code owner June 15, 2026 23:04
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a storage field to keyFuncEntry and wires it for all provider types. The JWT Keyfunc gains a non-blocking cache pre-pass: for tokens with a kid, it calls KeyRead on each provider's raw storage and skips providers that don't cache that key, falling back to the existing sequential loop only when needed. Per-provider audience and algorithm validation is extracted into a new tryEntry helper. Four timing-sensitive integration subtests validate the new behavior across multi-provider configurations.

Changes

Non-blocking KID cache pre-pass for multi-provider JWKS

Layer / File(s) Summary
keyFuncEntry storage field, Keyfunc pre-pass, and tryEntry helper
router/pkg/authentication/jwks_token_decoder.go
Adds storage jwkset.Storage to keyFuncEntry and populates it for JWKS URL and secret-derived providers. Rewrites the JWT Keyfunc to do a non-blocking KeyRead pre-pass per provider, skipping those that don't cache the token's kid before any blocking refresh. Extracts tryEntry to centralize audience, algorithm, and signing-key validation used by both the pre-pass and the sequential fallback.
Multi-provider RefreshUnknownKID timing tests
router-tests/security/authentication_test.go
Adds four subtests: owning provider listed second (no blocking), owning provider listed first with slow second provider (fast path), genuinely unknown kid (blocking refresh asserted), and owning provider listed third with two slow prior providers (delay must not scale with position).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly summarizes the main change: avoiding blocking token validation across JWKS providers, which is the core performance fix described in the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
router-tests/security/authentication_test.go (1)

1672-1673: 💤 Low value

Consider tightening the timing assertion for stronger test confidence.

With both servers set to 1-second respond times and sequential fallback execution, the expected elapsed time is ~2 seconds. The current assertion >= 700ms is loose enough that it would pass even if only one provider was queried. A tighter bound (e.g., >= 1500*time.Millisecond) would provide stronger evidence that both providers were actually invoked.

That said, the http.StatusUnauthorized check does confirm the overall behavior, so this is optional.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@router-tests/security/authentication_test.go` around lines 1672 - 1673, The
timing assertion in the require.GreaterOrEqual call for the elapsed duration is
too loose with a 700 millisecond threshold. Since both servers are configured
with 1-second response times and execute sequentially, the expected total
elapsed time should be around 2 seconds, meaning the assertion should be
tightened to approximately 1500 milliseconds or higher to properly verify that
both providers are actually being invoked and not just one. Update the
700*time.Millisecond value in the GreaterOrEqual assertion to a tighter bound
that better reflects the expected sequential execution of multiple providers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@router-tests/security/authentication_test.go`:
- Around line 1672-1673: The timing assertion in the require.GreaterOrEqual call
for the elapsed duration is too loose with a 700 millisecond threshold. Since
both servers are configured with 1-second response times and execute
sequentially, the expected total elapsed time should be around 2 seconds,
meaning the assertion should be tightened to approximately 1500 milliseconds or
higher to properly verify that both providers are actually being invoked and not
just one. Update the 700*time.Millisecond value in the GreaterOrEqual assertion
to a tighter bound that better reflects the expected sequential execution of
multiple providers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3d10faf4-4085-4c03-9633-35ccd7028101

📥 Commits

Reviewing files that changed from the base of the PR and between 1658315 and 8231222.

📒 Files selected for processing (2)
  • router-tests/security/authentication_test.go
  • router/pkg/authentication/jwks_token_decoder.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant