Skip to content

test: retry live RBAC per-server access#5482

Open
lucarlig wants to merge 3 commits into
mainfrom
user/luca/retry-live-rbac-server-access
Open

test: retry live RBAC per-server access#5482
lucarlig wants to merge 3 commits into
mainfrom
user/luca/retry-live-rbac-server-access

Conversation

@lucarlig

@lucarlig lucarlig commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Bug-fix PR

Summary

The live RBAC per-server MCP allow-path tests fail when they exercise a freshly-created user/server before ContextForge has published dataplane config to Redis. This PR fixes that at both ends:

  • makes the dataplane publisher snapshot interval configurable (DATAPLANE_PUBLISHER_INTERVAL_SECONDS, default 60s, unchanged), so functional test stacks can run a short interval;
  • makes the two allow-path checks wait for publisher convergence with a deadline that covers one full publish interval plus slack (default 75s, tunable via MCP_E2E_PUBLISHER_SYNC_DEADLINE).

Reproduction Steps

Run the live RBAC MCP transport tests against the split control-plane/dataplane stack. The failures were seen in:

  • test_public_token_accesses_public_server
  • test_team_member_accesses_team_server

Both tests create fixtures, then immediately call the per-server MCP endpoint.

Root Cause

The dataplane publisher is a periodic full-snapshot loop with a hardcoded 60s interval (REDIS_PUBLISHER_TIME in mcpgateway/services/dataplane_publisher.py); there is no event-driven publish. A user/server created between snapshots is invisible to the dataplane until the next cycle, so convergence latency is uniform 0-60s. A short fixed retry (the previous revision of this PR used 5 x 100ms) cannot cover that window: verified against a live split stack, the tests still failed 3 runs out of 3, and a hand-widened 30s retry still missed when the fixture ran just after a publish cycle.

Fix Description

  • mcpgateway/config.py: new dataplane_publisher_interval_seconds setting (default 60, ge=1).
  • mcpgateway/services/dataplane_publisher.py: interval read from settings; UserConfig key TTL derived as interval + 10, preserving the existing 60/70 relationship at the default.
  • .env.example: document the new variable.
  • tests/live_gateway/mcp/test_mcp_rbac_transport.py: allow-path helper now retries until a deadline (default 75s = one default publish interval + slack; 1s between attempts) instead of a fixed attempt count, and re-raises the original client error on timeout. MCP_E2E_PUBLISHER_SYNC_DEADLINE lets stacks with a short publisher interval keep the wait proportionally short.

Default-config deployments are behaviorally unchanged.

Reviewability

  • This PR has one clear purpose
  • The linked issue is not labeled triage
  • Unrelated bugs or improvements are tracked in separate issues/PRs
  • Tests are included with the code they validate

Verification

Check Command Status
Ruff lint uv run ruff check mcpgateway/config.py mcpgateway/services/dataplane_publisher.py tests/live_gateway/mcp/test_mcp_rbac_transport.py Pass
Ruff format uv run ruff format --check (same files) Pass
Publisher unit tests uv run pytest tests/unit/mcpgateway/services/test_dataplane_publisher.py 21 passed
Affected test collection uv run pytest tests/live_gateway/mcp/test_mcp_rbac_transport.py::TestMcpPerServerEndpoint --collect-only -q Pass
Live split stack, DATAPLANE_PUBLISHER_INTERVAL_SECONDS=2 both allow-path tests Pass (see verification comment)

MCP Compliance (if relevant)

  • Matches current MCP spec
  • No breaking change to MCP clients

Checklist

  • Code formatted (make black isort pre-commit)
  • No secrets/credentials committed

Signed-off-by: lucarlig <luca.carlig@ibm.com>
lucarlig added 2 commits July 2, 2026 18:44
The dataplane publisher interval was a hardcoded 60s module constant,
so runtime-created users and servers stayed invisible to the dataplane
for up to a full minute. Expose it as
DATAPLANE_PUBLISHER_INTERVAL_SECONDS (default 60, unchanged) and derive
the UserConfig key TTL as interval + 10, preserving the current 60/70
relationship. Functional test stacks can now run a short interval while
load benchmarks keep the default.

Signed-off-by: lucarlig <luca.carlig@ibm.com>
A 5 x 100ms retry window cannot cover the dataplane publisher snapshot
cycle (default 60s), so the allow-path tests still failed on the split
stack. Switch the helper to a deadline-based wait that covers one full
publish interval plus slack by default (75s), tunable via
MCP_E2E_PUBLISHER_SYNC_DEADLINE for stacks that run a short
DATAPLANE_PUBLISHER_INTERVAL_SECONDS.

Signed-off-by: lucarlig <luca.carlig@ibm.com>
@lucarlig

lucarlig commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Verified the updated approach against a live split control-plane/dataplane stack (contextforge-gateway-rs:0.1.0, control plane built from this branch):

  • DATAPLANE_PUBLISHER_INTERVAL_SECONDS=2 on the gateway: Redis UserConfig keys refresh every ~2s with TTL 12 (interval + 10), confirming the setting flows through.
  • TestMcpPerServerEndpoint (all 4 tests, including both previously-flaky allow-path checks) with MCP_E2E_PUBLISHER_SYNC_DEADLINE=10: 4 passed, four consecutive runs.
  • Full test_mcp_rbac_transport.py suite: the only failure is the pre-existing test_admin_sees_public_and_team_via_http visibility issue, which reproduces identically on the stock stack and is unrelated to this change.
  • Default config (60s interval / 75s deadline) is behaviorally unchanged; publisher unit tests pass (21).

One note for anyone running the split stack: with a short publish interval the dataplane's per-subject config cache (CONTEXTFORGE_GATEWAY_RS_USER_CONFIG_CACHE_EXPIRY_SECONDS) is worth setting to 0 in test environments — its sliding TTL can pin a snapshot taken before the fixture servers were associated, and steady retries keep renewing it. That's a dataplane-side setting, not part of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant