feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes#2859
Open
Bartok9 wants to merge 2 commits into
Open
Conversation
…rrent refresh spikes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refresh OAuth tokens proactively at ~80% of their lifetime with a small random jitter, instead of only reactively once they've already expired. This reduces the "thundering herd" of simultaneous token refreshes that occurs when a fleet of OAuth-backed MCP connectors is provisioned around the same time.
The production problem
When many MCP clients each hold an OAuth connection and were provisioned (or last refreshed) in roughly the same window, their access tokens all expire inside the same narrow window too. Today refresh only fires after
is_token_valid()returnsFalse(i.e. after hard expiry), so all of those clients try to refresh at nearly the same moment.For a large fleet that produces a synchronized burst of
grant_type=refresh_tokenrequests against the authorization server — contention, rate-limit (429) responses, and spurious auth failures, all clustered into the same ~60s window. The herd then re-synchronizes on the new tokens and the spike repeats on the next cycle.The design
Add a per-connection proactive refresh point that sits before hard expiry and is individually jittered, so a fleet desynchronizes naturally:
refresh_fraction= 0.8 by default → refresh once 80% of the lifetime has elapsed, leaving headroom before hard expiry.jitter∈ [0, 30s] by default, always subtracted so it can only pull the refresh point earlier — it can never push past hard expiry. Each connector draws its own jitter, so refreshes spread out across the window rather than bunching up.New pieces:
calculate_token_refresh_time(expires_in, *, refresh_fraction=0.8, max_jitter_seconds=30.0, jitter=None)insrc/mcp/shared/auth_utils.py— pure, deterministic-testable (injectjitterto bypass the RNG), returnsNonewhenexpires_inisNone.OAuthContext.token_refresh_timefield, set alongsidetoken_expiry_timeinupdate_token_expiryand cleared inclear_tokens.OAuthContext.should_refresh_token()—Truewhen we hold refreshable tokens and we're past the jittered proactive-refresh point, even if the token is still technically valid.async_auth_flowPhase 1 now refreshes when the token is hard-invalid ORshould_refresh_token()isTrue(whilecan_refresh_token()), keeping the existing re-check / lock structure intact.is_token_valid()is deliberately unchanged — it still gates whether a token is usable at all (hard validity). Proactive refresh is layered on top.Edge cases handled
expires_in is None→token_refresh_timeisNone;should_refresh_token()returnsFalseand behavior degrades to the existing reactive path.expires_insmaller thanmax_jitter_seconds): jitter is clamped to the available(refresh_at - now)window so the result never goes negative or beforenow.(now, hard_expiry].expires_in(some servers return it as a string): handled viaint()likecalculate_token_expiry.Backward compatibility
Fully backward compatible. No public signatures change; defaults preserve the current behavior shape (proactive refresh is strictly an improvement, not a breaking change). Clients that never got an
expires_inkeep the old reactive behavior exactly.Test coverage
tests/shared/test_auth_utils.py— 9 new tests forcalculate_token_refresh_time: normal TTL within the jitter window and strictly before hard expiry,None→None, stringexpires_in, deterministic injected jitter, jitter ordering (more jitter → earlier), never-past-hard-expiry across many TTLs, tiny-TTL no-negative, zero-TTL collapse, custom fraction.tests/client/test_auth.py—should_refresh_token()predicate (hard-valid-but-past-window → True; fresh → False; no refresh time → False; no refresh token → False), plus twoasync_auth_flowintegration tests: one proving a proactive refresh request is yielded while the token is still hard-valid, and one proving a fresh token is used directly with no refresh.All
tests/client/test_auth.py+tests/shared/test_auth_utils.pypass (128 passed, 1 xfailed).uv run ruff check,uv run ruff format --check, anduv run pyrightare clean on all touched files.Relationship to #2858
This is complementary to and independent of #2858. That PR addresses concurrency/locking of refresh (narrowing the
anyio.Lockscope + single-flightrefresh_lockto fix aRuntimeError). This PR is purely about when a refresh fires (proactive + jittered), not how it's locked. They touch different concerns and compose cleanly; if #2858 lands first this will need only a trivial rebase.Credit
Motivated by production feedback from @Ben-Home (CorpusIQ) on #2847. Refs #2847, #2858.