feat(AGX1-272): dual-write agent_api_keys to spark-authz behind FGAC flag#248
feat(AGX1-272): dual-write agent_api_keys to spark-authz behind FGAC flag#248dm36 wants to merge 6 commits into
Conversation
Closes the parent_agent cascade gap surfaced on scaleapi/agentex#354. The api_key dual-write (AGX1-272, PR #248) currently calls grant() which writes the owner edge in SpiceDB but NOT the parent_agent edge. The agent_api_key schema requires `read = ... & parent_agent->read & ...`, so every downstream read/update fails closed without that edge. This PR adds register_resource/deregister_resource (Port + adapter + service) and swaps the api_keys use case from grant→register_resource with parent=AgentexResource.agent(agent_id). Now the owner edge and parent_agent edge are written atomically. Stack: - scaleapi/scaleapi#144657 — sgp-authz 0.7.0 (parent_resource kwarg). - scaleapi/agentex#355 — agentex-auth Port + adapter + HTTP routes. - #248 — original AGX1-272 dual-write (this stacks on it). - THIS PR — extends #248 to use the parent-aware path. Changes: - Port: abstract register_resource(resource, parent=None) and deregister_resource(resource). - Adapter proxy: POST /v1/authz/register and /v1/authz/deregister. - Service: mirror existing grant/revoke pattern (principal_context override, _bypass support, parent in log line for cascade debugging). - Use case: swap grant→register_resource passing parent=agent; swap revoke→deregister_resource. except Exception wrappers preserved (fail-closed on register, best-effort on deregister). - Tests: rename mocks to register_resource/deregister_resource; assert the parent edge is passed correctly. Test plan: - pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py → 8 / 8 pass. - New test ``test_create_api_key_calls_grant_when_flag_on`` asserts parent.type == AgentexResourceType.agent and parent.selector == agent.id. Other resource types' grant→register_resource swap is out of scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…AGENT_API_KEYS_DUAL_WRITE flag Mirrors the AGX1-274 task dual-write pattern (PR #246) for agent_api_keys. - Adds creator_user_id / creator_service_account_id / spark_authz_zedtoken columns to agent_api_keys, with CHECK constraint and concurrent indexes. - On create, when FGAC_AGENT_API_KEYS_DUAL_WRITE is enabled for the caller's account, calls authorization_service.grant(AgentexResource.api_key(id)) BEFORE the Postgres write. Grant failure aborts the create. - On delete, best-effort revoke after the Postgres delete. Failures are logged but do not block the delete. - Adds AgentexResourceType.api_key and AgentexResource.api_key(...) factory. - Creates src/utils/feature_flags.py with both FGAC_TASKS_DUAL_WRITE and FGAC_AGENT_API_KEYS_DUAL_WRITE (file does not exist on main yet; if PR #246 lands first this becomes a rebase concern). Structural divergence from tasks: agent_api_keys have no service layer, so the dual-write logic lives in AgentAPIKeysUseCase rather than a separate service. This keeps the call site simple and avoids inventing a new layer. Route layer (read-side auth checks) is out of scope; that's PR B (AGX1-273). agentex-auth spark_mapping.py update is a sibling-repo concern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the parent_agent cascade gap surfaced on scaleapi/agentex#354. The api_key dual-write (AGX1-272, PR #248) currently calls grant() which writes the owner edge in SpiceDB but NOT the parent_agent edge. The agent_api_key schema requires `read = ... & parent_agent->read & ...`, so every downstream read/update fails closed without that edge. This PR adds register_resource/deregister_resource (Port + adapter + service) and swaps the api_keys use case from grant→register_resource with parent=AgentexResource.agent(agent_id). Now the owner edge and parent_agent edge are written atomically. Stack: - scaleapi/scaleapi#144657 — sgp-authz 0.7.0 (parent_resource kwarg). - scaleapi/agentex#355 — agentex-auth Port + adapter + HTTP routes. - #248 — original AGX1-272 dual-write (this stacks on it). - THIS PR — extends #248 to use the parent-aware path. Changes: - Port: abstract register_resource(resource, parent=None) and deregister_resource(resource). - Adapter proxy: POST /v1/authz/register and /v1/authz/deregister. - Service: mirror existing grant/revoke pattern (principal_context override, _bypass support, parent in log line for cascade debugging). - Use case: swap grant→register_resource passing parent=agent; swap revoke→deregister_resource. except Exception wrappers preserved (fail-closed on register, best-effort on deregister). - Tests: rename mocks to register_resource/deregister_resource; assert the parent edge is passed correctly. Test plan: - pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py → 8 / 8 pass. - New test ``test_create_api_key_calls_grant_when_flag_on`` asserts parent.type == AgentexResourceType.agent and parent.selector == agent.id. Other resource types' grant→register_resource swap is out of scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9668f1a to
e72df68
Compare
…L_WRITE
Per team discussion: rather than maintain a parallel env-var flag system
in scale-agentex, route api_key dual-write flag checks through
egp-api-backend's existing flag service. One source of truth across services,
single flip surface for ops, fewer per-env env-var allowlists to keep in sync.
Changes:
- EnvVarKeys.EGP_API_BACKEND_URL — new env var for the egp-api-backend
base URL. Used by the new HTTP-backed flag provider.
- FeatureFlagProvider rewritten as an HTTP client of egp-api-backend's
GET /feature-flag/{id} endpoint:
* Forwards x-api-key / x-user-id / x-service-account-id /
x-selected-account-id from the caller's principal_context so the
endpoint's REQUIRE_IDENTITY_AND_OPTIONAL_ACCOUNT policy admits the
request.
* Coerces the response's `value` field to bool.
* Fails closed to False on any error (config missing, no identity,
non-2xx, transport failure, JSON parse failure) — the legacy
no-Spark code path is the safe default.
* `is_enabled` is now async (HTTP call). Signature is
`is_enabled(name, *, principal_context, account_id)`.
- AgentAPIKeysUseCase: both call sites now await is_enabled and pass
principal_context. _deregister grabs principal_context from
self.authorization_service.
- Test fixtures: mock FeatureFlagProvider directly (Mock with
is_enabled = AsyncMock(return_value=flag_on)) so dual-write tests stay
hermetic. The pre-existing FeatureFlagProvider() no-arg constructions
in test_agents_api_keys_use_case.py and integration_client.py now pass
egp_api_backend_url=None (provider returns False without it, matching
the prior "flag never enabled in unit tests" behavior).
Out of scope:
- Migrating Asher's FGAC_TASKS_DUAL_WRITE flag check off env vars.
That's task-team-owned and we leave their existing pattern alone per
the team discussion (new-work-only).
- Caching the flag response. Each is_enabled is a fresh HTTP call.
Egp-api-backend's flag endpoint is fast and the caller paths are
already crossing the network for the actual register/deregister, so
one extra round-trip is acceptable for now. Add caching later if
load profiling shows it matters.
Test plan:
- uv run pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py — 8/8 pass.
- Existing 4 unrelated test_agents_api_keys_use_case.py docker-fixture
errors predate this commit (verified via `git stash`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| FGAC_AGENT_API_KEYS_DUAL_WRITE = "fgac-agent-api-keys-dual-write" | ||
|
|
||
|
|
||
| class FeatureFlagProvider: |
There was a problem hiding this comment.
FeatureFlagProvider couples OSS scale-agentex to egp-api-backend. Can we not add feature flag checks in agentex-auth/ inside the register implementation?
| None, | ||
| description="Service identity ID of the service account that created this API key", | ||
| ) | ||
| spark_authz_zedtoken: str | None = Field( |
There was a problem hiding this comment.
spark_authz_zedtoken is spark-authz specific token and is leaking a proprietary concept into the OSS Postgres schema/entity.
| description="The type of the API key (either internal or external)", | ||
| ) | ||
| api_key: str = Field(..., description="The API key") | ||
| creator_user_id: str | None = Field( |
There was a problem hiding this comment.
Are we using creator_user_id / creator_service_account_id anywhere? Is it really something we want to persist in the db?
Three rounds of Harvey's review feedback together pointed at the same concern: scale-agentex (OSS) should not be coupled to egp-api-backend (proprietary) feature-flag service or Spark-specific schema concepts. Changes: - Drop FeatureFlagProvider and the egp-api-backend HTTP query entirely. scale-agentex now calls register_resource/deregister_resource unconditionally. agentex-auth's per-account routing (FGAC_AGENTEX_AUTH_SPARK, scaleapi/agentex#353) decides whether the call lands on Spark AuthZ or falls back to legacy SGP for the caller's account. - Drop the creator_user_id / creator_service_account_id columns from the agent_api_keys table. They were only used as a runtime guard inside the auth-registration helper. Moved that check inline. - Drop the spark_authz_zedtoken column. The name leaked SpiceDB-specific terminology into the OSS schema, and the column was always None. - Delete the Alembic migration entirely. migration_history.txt re-pointed to head=6c942325c828. - Rename helper methods: _register_api_key_in_spark_authz -> _register_api_key_in_auth, same for deregister. Test plan: 6/6 dual-write integration tests pass. Pre-existing docker-fixture errors in test_agents_api_keys_use_case.py unrelated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| async def delete(self, id: str) -> None: | ||
| return await self.agent_api_key_repo.delete(id=id) | ||
| async def delete(self, id: str, account_id: str | None = None) -> None: | ||
| await self.agent_api_key_repo.delete(id=id) |
There was a problem hiding this comment.
do we also need to check if the api key exists (self.agent_api_key_repo.get(id=id)) before deleting + calling _deregister_api_key_from_auth, like the other two delete_<> functions added here?
Per @jenniechung review: delete(id) used to call deregister unconditionally, which wastes an auth round-trip when the id doesn't exist. The two delete_by_* methods already use pre-fetch + `if existing is not None`; delete() now matches that pattern via try/except ItemDoesNotExist. Also tightened the unconditional-register comment from 5 lines to 2. Tests: - Fixture now defaults repo.get() to a stub entity so existing delete tests flow through the deregister path; new helper `_stub_api_key`. - New test_delete_api_key_skips_deregister_when_row_does_not_exist asserts deregister.assert_not_called() when repo.get raises ItemDoesNotExist. - 7/7 dual-write tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t fixture Per reviewer: the agent_api_keys_use_case fixture had `grant` and `revoke` set up as AsyncMocks but not `register_resource` / `deregister_resource`. The deregister path silently swallowed the resulting TypeError via its `except Exception` handler, so delete tests passed without exercising the auth-side call. Matches the integration_client.py fixture shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| raise HTTPException(status_code=409, detail=error_msg) | ||
|
|
||
| new_api_key = request.api_key or secrets.token_hex(32) | ||
| account_id = getattr(authorization_service.principal_context, "account_id", None) |
There was a problem hiding this comment.
Now that FF is moved to agentex-auth, do we still need to thread account_id here? Suggest dropping the account_id extraction/threading entirely and letting agentex-auth derive the account_id from the principal. This keeps the SGP-specific field name out of the OSS repo. The only place I see you're actually using account_id is in log extract, which IMO should be dropped as well.
| api_key_type=api_key_type, | ||
| api_key=api_key, | ||
| ) | ||
| return await self.agent_api_key_repo.create(item=agent_api_key) |
There was a problem hiding this comment.
Should we follow the egp-api-backend's pattern and add compensating deregister if repo.create fails?
| existing = await self.agent_api_key_repo.get(id=id) | ||
| except ItemDoesNotExist: | ||
| existing = None | ||
| await self.agent_api_key_repo.delete(id=id) |
There was a problem hiding this comment.
nit: if ItemDoesNotExist, shouldn't we skip repo.delete as well and just return?
Summary
Wires
register_resource/deregister_resourceintoAgentAPIKeysUseCase.create/deleteso api_keys get a SpiceDB tuple (withparent_agentedge) on create and have that tuple removed on delete. Companion to the route-migration PR (#252).scale-agentex is intentionally decoupled from egp-api-backend. The dual-write call is unconditional from this repo. agentex-auth's per-account routing (scaleapi/agentex#353,
FGAC_AGENTEX_AUTH_SPARK) decides whether the call lands on Spark AuthZ or falls back to the legacy SGP backend for the caller's account.Linear ticket: https://linear.app/scale-epd/issue/AGX1-272/fgac-agent-api-key-dual-write-registerderegister-on-createdelete
Cross-repo stack
parent_resourcekwarg/v1/authz/register,/v1/authz/deregister) +api_keymappingFGAC_AGENTEX_AUTH_SPARKflagregister_resource(parent=agent)Changes
Port + adapter
src/adapters/authorization/port.py— abstractregister_resource(resource, parent=None)andderegister_resource(resource)onAuthorizationGateway.src/adapters/authorization/adapter_agentex_authz_proxy.py— POST/v1/authz/registerand/v1/authz/deregisterto agentex-auth (endpoints exposed by #354).Service layer
src/domain/services/authorization_service.py—register_resource/deregister_resourceservice methods mirror the grant/revoke pattern (principal_contextoverride,_bypasssupport, parent identity in log line).Use case
src/domain/use_cases/agent_api_keys_use_case.py:createcalls_register_api_key_in_authunconditionally;parent=AgentexResource.agent(agent_id)is load-bearing for the SpiceDB cascade.delete(anddelete_by_*variants) call_deregister_api_key_from_authafter the Postgres row is gone.principal_contextat call time — no persisted creator columns.Tests
tests/integration/services/test_agent_api_key_service_dual_write.py— 6 cases. Mocks the authorization service; asserts the call sequencing inside the use case.tests/integration/fixtures/integration_client.py+tests/unit/use_cases/test_agents_api_keys_use_case.pyupdated for the new constructor signature.Test plan
test_create_api_key_calls_register_resource_with_parentassertsparent=AgentexResource.agent(agent.id)is passed on every register call.test_agents_api_keys_use_case.pypredate this PR (verified viagit stash).Diff scope
10 files changed, 55 insertions(+), 368 deletions(-)— net deletion. Earlier iterations had ~370 lines of OSS-coupling that Harvey's review pruned.Out of scope / follow-ups
spark_authz_zedtoken).task,build,deployment,schedule). Each owns its own follow-up.Linked: AGX1-272.
Greptile Summary
Wires
register_resource/deregister_resourceintoAgentAPIKeysUseCase.createanddelete, so every api_key gets a SpiceDB tuple (with theparent_agentedge) on creation and has it removed on deletion. Registration is fail-closed (raises, blocking the Postgres write); deregistration is best-effort (logs, never blocks the delete). Internal/identity-less paths are skipped with a warning rather than failing.port.py,adapter_agentex_authz_proxy.py): new abstract methods and their HTTP implementations follow the exact pattern of the pre-existinggrant/revokepair.authorization_service.py):register_resource/deregister_resourcemirrorgrant/revoke— bypass check, principal override via sentinel..., then gateway delegation.agent_api_keys_use_case.py): pre-generatesapi_key_idso the same ID is used for both the SpiceDB tuple and the Postgres row; all three delete variants call deregister after the row is gone.Confidence Score: 5/5
Safe to merge — the dual-write is decoupled from egp-api-backend's feature flag, fail-closed on register and best-effort on deregister semantics are correctly implemented, and the SpiceDB parent-agent edge is tested end-to-end.
All three delete variants are wired correctly, the pre-generated api_key_id ensures the same ID lands in both SpiceDB and Postgres, and the bypass / no-creator guards are complementary rather than conflicting. The integration test suite pins the parent-edge contract and the key ordering invariants.
No files require special attention. The only observation is a test-coverage gap in the _principal helper — it always sets service_account_id=None, leaving the service-account-only creation path untested.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Route as API Route participant UC as AgentAPIKeysUseCase participant AuthSvc as AuthorizationService participant GW as AgentexAuthorizationProxy participant DB as Postgres Note over Route,DB: CREATE flow (fail-closed) Route->>UC: create(agent_id, api_key, account_id) UC->>DB: get agent (validate exists) UC->>UC: generate api_key_id UC->>UC: _register_api_key_in_auth() UC->>UC: check principal user_id / service_account_id alt no creator identity UC-->>UC: skip + warning log else creator present UC->>AuthSvc: "register_resource(resource=api_key, parent=agent)" AuthSvc->>AuthSvc: _bypass() check alt bypass AuthSvc-->>UC: return (no-op) else enabled AuthSvc->>GW: register_resource(principal, resource, parent) GW->>GW: POST /v1/authz/register alt SpiceDB error GW-->>UC: raise (DB write blocked) end end end UC->>DB: repo.create(api_key_entity) UC-->>Route: AgentAPIKeyEntity Note over Route,DB: DELETE flow (best-effort) Route->>UC: delete(id, account_id) UC->>DB: repo.get(id) UC->>DB: repo.delete(id) alt row existed UC->>AuthSvc: "deregister_resource(resource=api_key)" AuthSvc->>GW: POST /v1/authz/deregister alt failure GW-->>UC: raise UC-->>UC: catch + warning log end end UC-->>Route: NoneComments Outside Diff (1)
agentex/tests/integration/services/test_agent_api_key_service_dual_write.py, line 703-727 (link)The suite covers
grantfailure preventing the DB write, but not the reverse:grantsucceeding followed byrepo.createraising (e.g., a duplicate key or transient DB error). In that case a Spark tuple exists for anapi_key_idthat never lands in Postgres. Adding a test for this case would make the documented known-limitation explicit and guard against accidental regression if compensating logic is added later.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (6): Last reviewed commit: "test(AGX1-272): add AsyncMock for regist..." | Re-trigger Greptile