feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes by Bartok9 · Pull Request #2859 · modelcontextprotocol/python-sdk

Bartok9 · 2026-06-13T15:52:17Z

Summary

Refresh OAuth tokens proactively at ~80% of their lifetime with a small random jitter, instead of only reactively once they've already expired. This reduces the "thundering herd" of simultaneous token refreshes that occurs when a fleet of OAuth-backed MCP connectors is provisioned around the same time.

The production problem

When many MCP clients each hold an OAuth connection and were provisioned (or last refreshed) in roughly the same window, their access tokens all expire inside the same narrow window too. Today refresh only fires after is_token_valid() returns False (i.e. after hard expiry), so all of those clients try to refresh at nearly the same moment.

For a large fleet that produces a synchronized burst of grant_type=refresh_token requests against the authorization server — contention, rate-limit (429) responses, and spurious auth failures, all clustered into the same ~60s window. The herd then re-synchronizes on the new tokens and the spike repeats on the next cycle.

The design

Add a per-connection proactive refresh point that sits before hard expiry and is individually jittered, so a fleet desynchronizes naturally:

refresh_at = now + expires_in * refresh_fraction - jitter

refresh_fraction = 0.8 by default → refresh once 80% of the lifetime has elapsed, leaving headroom before hard expiry.
jitter ∈ [0, 30s] by default, always subtracted so it can only pull the refresh point earlier — it can never push past hard expiry. Each connector draws its own jitter, so refreshes spread out across the window rather than bunching up.

New pieces:

calculate_token_refresh_time(expires_in, *, refresh_fraction=0.8, max_jitter_seconds=30.0, jitter=None) in src/mcp/shared/auth_utils.py — pure, deterministic-testable (inject jitter to bypass the RNG), returns None when expires_in is None.
OAuthContext.token_refresh_time field, set alongside token_expiry_time in update_token_expiry and cleared in clear_tokens.
OAuthContext.should_refresh_token() — True when we hold refreshable tokens and we're past the jittered proactive-refresh point, even if the token is still technically valid.
async_auth_flow Phase 1 now refreshes when the token is hard-invalid OR should_refresh_token() is True (while can_refresh_token()), keeping the existing re-check / lock structure intact.

is_token_valid() is deliberately unchanged — it still gates whether a token is usable at all (hard validity). Proactive refresh is layered on top.

Edge cases handled

expires_in is None → token_refresh_time is None; should_refresh_token() returns False and behavior degrades to the existing reactive path.
Tiny TTLs (e.g. expires_in smaller than max_jitter_seconds): jitter is clamped to the available (refresh_at - now) window so the result never goes negative or before now.
Never past hard expiry: the result is always clamped into (now, hard_expiry].
String expires_in (some servers return it as a string): handled via int() like calculate_token_expiry.

Backward compatibility

Fully backward compatible. No public signatures change; defaults preserve the current behavior shape (proactive refresh is strictly an improvement, not a breaking change). Clients that never got an expires_in keep the old reactive behavior exactly.

Test coverage

tests/shared/test_auth_utils.py — 9 new tests for calculate_token_refresh_time: normal TTL within the jitter window and strictly before hard expiry, None → None, string expires_in, deterministic injected jitter, jitter ordering (more jitter → earlier), never-past-hard-expiry across many TTLs, tiny-TTL no-negative, zero-TTL collapse, custom fraction.
tests/client/test_auth.py — should_refresh_token() predicate (hard-valid-but-past-window → True; fresh → False; no refresh time → False; no refresh token → False), plus two async_auth_flow integration tests: one proving a proactive refresh request is yielded while the token is still hard-valid, and one proving a fresh token is used directly with no refresh.

All tests/client/test_auth.py + tests/shared/test_auth_utils.py pass (128 passed, 1 xfailed). uv run ruff check, uv run ruff format --check, and uv run pyright are clean on all touched files.

Relationship to #2858

This is complementary to and independent of #2858. That PR addresses concurrency/locking of refresh (narrowing the anyio.Lock scope + single-flight refresh_lock to fix a RuntimeError). This PR is purely about when a refresh fires (proactive + jittered), not how it's locked. They touch different concerns and compose cleanly; if #2858 lands first this will need only a trivial rebase.

Credit

Motivated by production feedback from @Ben-Home (CorpusIQ) on #2847. Refs #2847, #2858.

…rrent refresh spikes

…rcised by tests

feat(auth): proactive OAuth token refresh with jitter to reduce concu…

b6da2c6

…rrent refresh spikes

Bartok9 mentioned this pull request Jun 13, 2026

Bug: anyio.Lock in async_auth_flow causes RuntimeError under concurrent OAuth MCP connections #2847

Open

fix(auth): drop stale no-cover pragmas on jitter clamp guards now exe…

f8699ea

…rcised by tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes#2859

feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes#2859
Bartok9 wants to merge 2 commits into
modelcontextprotocol:mainfrom
Bartok9:feat/oauth-proactive-refresh-jitter

Bartok9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bartok9 commented Jun 13, 2026

Summary

The production problem

The design

Edge cases handled

Backward compatibility

Test coverage

Relationship to #2858

Credit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant