fix(router): avoid blocking token validation across JWKS providers by ebachle · Pull Request #2974 · wundergraph/cosmo

ebachle · 2026-06-15T23:04:46Z

Summary

When the router is configured with more than one JWKS provider under authentication.jwt.jwks and refresh_unknown_kid is enabled, a token signed by any provider other than the first-listed one incurred an avoidable, rate-limiter-bounded blocking delay on the request path — up to refresh_unknown_kid.max_wait (default 2m) per request under sustained load.

Why it happened

keyFuncWrapper in router/pkg/authentication/jwks_token_decoder.go validated a token by trying each configured provider sequentially, in config order. On a kid cache miss with refresh_unknown_kid enabled, the underlying jwkset client blocks on its rate limiter and performs a live HTTP refresh before reporting "key not found" and letting the next provider try.

Because two providers with separate identity sources have disjoint kid sets by design, a token signed by provider #2 always missed provider #1's cache and tripped provider #1's blocking unknown-kid path. The cost scaled with the number of providers listed before the one that owns the key.

The fix

Add a non-blocking, refresh-free cache pre-pass to keyFuncWrapper. When a token carries a kid, each provider's in-memory storage is consulted directly (Storage.KeyRead, an in-memory lookup that never triggers the rate-limited refresh). If a provider already holds the kid — the steady state for known / already-rotated keys — the token is validated against that provider and returned without ever touching another provider's rate limiter or slow endpoint.

Tokens without a kid, and kids that no provider has cached, fall through to the existing sequential loop, so:

genuinely-unknown kids keep their current behavior (still rate-limited + refreshed), and
JWKS HTTP refresh volume is unchanged (the rate limiter still governs real refreshes).

The per-entry audience/algorithm validation is extracted into a tryEntry helper so the pre-pass and the fallback loop share identical semantics.

How to test

New integration tests live in router-tests/security/authentication_test.go under TestHttpJwksAuthorization.

cd router-tests

# The two regression cases. On the previous behavior these FAIL on latency
# (~2s for a 2-provider setup, ~4s for 3 providers); with this change they pass in <500ms.
go test ./security/ -run 'TestHttpJwksAuthorization/(known_kid_owned_by_a_later_provider|known_kid_owned_by_the_third_provider)' -v -count=1

# Guards that behavior is otherwise unchanged:
#  - a key owned by the FIRST provider still validates fast (no regression)
#  - a genuinely-unknown kid still blocks (>=700ms), i.e. semantics preserved
go test ./security/ -run 'TestHttpJwksAuthorization/(known_kid_owned_by_the_first_provider|genuinely_unknown_kid_still_blocks)' -v -count=1

# Full auth suite — no regressions:
go test ./security/ -run 'TestAuthentication|TestHttpJwksAuthorization|TestNonHttpAuthorization|TestAudienceValidation|TestAlgorithmMismatch|TestSupportedAlgorithms|TestUseCustomization|TestMultipleKeys|TestOidcDiscovery|TestIntrospectionAuthentication' -count=1

The tests drive the latency deterministically: a non-owning provider is made slow via Server.SetRespondTime, with max_wait > interval so the limiter proceeds into the (slow) refresh rather than short-circuiting. The measured request asserts validation of a later-provider key completes in < 500ms.

Notes

No config surface change. This is an internal latency fix; existing YAML/behavior for users is unchanged apart from the removed blocking.
Follow-ups discussed but intentionally out of scope here: issuer/iss-based provider routing (also fixes the genuinely-unknown-kid and garbage-token cases) and an opt-in async "trigger-and-return" refresh. Both would change config surface and/or semantics and are better tracked separately.

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added comprehensive test cases for authentication behavior with multiple token providers and variable endpoint performance.
Bug Fixes
- Improved token validation efficiency with optimized provider lookup and caching, reducing unnecessary blocking operations when handling multiple authentication providers.

When more than one JWKS provider is configured and refresh_unknown_kid is enabled, a token signed by any provider other than the first-listed one incurred an avoidable blocking delay on the request path. The decoder tried each provider sequentially, and a provider that did not own the token's kid would block on its rate limiter and perform a live HTTP refresh (bounded by max_wait, default 2m) before reporting "key not found" and letting the next provider try. Because providers have disjoint kid sets by design, a token owned by a later provider always missed earlier providers' caches and paid that penalty, scaling with the number of providers listed before the owner. Add a non-blocking, refresh-free cache pre-pass to keyFuncWrapper: when the token carries a kid, consult each provider's in-memory storage directly. If a provider already holds the kid (the steady state for known/rotated keys), validate against it and return without ever touching another provider's rate limiter or slow endpoint. Tokens without a kid, and kids no provider has cached, fall through to the existing sequential loop, so genuinely-unknown kids keep their current behavior and HTTP refresh volume is unchanged. The per-entry audience/algorithm validation is extracted into tryEntry so the pre-pass and the fallback loop share identical semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-15T23:05:10Z

Walkthrough

Adds a storage field to keyFuncEntry and wires it for all provider types. The JWT Keyfunc gains a non-blocking cache pre-pass: for tokens with a kid, it calls KeyRead on each provider's raw storage and skips providers that don't cache that key, falling back to the existing sequential loop only when needed. Per-provider audience and algorithm validation is extracted into a new tryEntry helper. Four timing-sensitive integration subtests validate the new behavior across multi-provider configurations.

Changes

Non-blocking KID cache pre-pass for multi-provider JWKS

Layer / File(s)	Summary
`keyFuncEntry` storage field, Keyfunc pre-pass, and `tryEntry` helper `router/pkg/authentication/jwks_token_decoder.go`	Adds `storage jwkset.Storage` to `keyFuncEntry` and populates it for JWKS URL and secret-derived providers. Rewrites the JWT `Keyfunc` to do a non-blocking `KeyRead` pre-pass per provider, skipping those that don't cache the token's `kid` before any blocking refresh. Extracts `tryEntry` to centralize audience, algorithm, and signing-key validation used by both the pre-pass and the sequential fallback.
Multi-provider `RefreshUnknownKID` timing tests `router-tests/security/authentication_test.go`	Adds four subtests: owning provider listed second (no blocking), owning provider listed first with slow second provider (fast path), genuinely unknown `kid` (blocking refresh asserted), and owning provider listed third with two slow prior providers (delay must not scale with position).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title directly summarizes the main change: avoiding blocking token validation across JWKS providers, which is the core performance fix described in the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

router-tests/security/authentication_test.go (1)
1672-1673: 💤 Low value

Consider tightening the timing assertion for stronger test confidence.

With both servers set to 1-second respond times and sequential fallback execution, the expected elapsed time is ~2 seconds. The current assertion >= 700ms is loose enough that it would pass even if only one provider was queried. A tighter bound (e.g., >= 1500*time.Millisecond) would provide stronger evidence that both providers were actually invoked.

That said, the http.StatusUnauthorized check does confirm the overall behavior, so this is optional.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@router-tests/security/authentication_test.go` around lines 1672 - 1673, The
timing assertion in the require.GreaterOrEqual call for the elapsed duration is
too loose with a 700 millisecond threshold. Since both servers are configured
with 1-second response times and execute sequentially, the expected total
elapsed time should be around 2 seconds, meaning the assertion should be
tightened to approximately 1500 milliseconds or higher to properly verify that
both providers are actually being invoked and not just one. Update the
700*time.Millisecond value in the GreaterOrEqual assertion to a tighter bound
that better reflects the expected sequential execution of multiple providers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@router-tests/security/authentication_test.go`:
- Around line 1672-1673: The timing assertion in the require.GreaterOrEqual call
for the elapsed duration is too loose with a 700 millisecond threshold. Since
both servers are configured with 1-second response times and execute
sequentially, the expected total elapsed time should be around 2 seconds,
meaning the assertion should be tightened to approximately 1500 milliseconds or
higher to properly verify that both providers are actually being invoked and not
just one. Update the 700*time.Millisecond value in the GreaterOrEqual assertion
to a tighter bound that better reflects the expected sequential execution of
multiple providers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3d10faf4-4085-4c03-9633-35ccd7028101

📥 Commits

Reviewing files that changed from the base of the PR and between 1658315 and 8231222.

📒 Files selected for processing (2)

router-tests/security/authentication_test.go
router/pkg/authentication/jwks_token_decoder.go

ebachle requested a review from a team as a code owner June 15, 2026 23:04

github-actions Bot added the router label Jun 15, 2026

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(router): avoid blocking token validation across JWKS providers#2974

fix(router): avoid blocking token validation across JWKS providers#2974
ebachle wants to merge 1 commit into
wundergraph:mainfrom
ebachle:fix/jwks-multi-provider-unknown-kid-blocking

ebachle commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ebachle commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it happened

The fix

How to test

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ebachle commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading