feat: 010 scalability improvements#301
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR implements the first slice of feature 010 horizontal scaling correctness for the Guardian server by moving previously per-process coordination state (auth sessions/challenges and canonicalization ownership) into shared Postgres-backed coordination, adding per-replica rate-limit partitioning keyed by an infra-sourced max replica capacity, and documenting the required HA configuration.
Changes:
- Introduces a
coordinationsubsystem with in-memory (filesystem/dev) and Postgres (prod) implementations for sessions, challenges, and canonicalization leadership. - Updates canonicalization to run under a shared lease with renewal + fencing checks, and adds periodic sweeps for expired sessions/challenges.
- Wires
GUARDIAN_MAX_REPLICASthrough Terraform and documents HA/runbook guidance + prod-stage startup guards.
Reviewed changes
Copilot reviewed 47 out of 48 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| speckit/features/010-horizontal-scaling/spec.md | New feature spec describing HA correctness goals, scenarios, and requirements. |
| speckit/features/010-horizontal-scaling/data-model.md | Defines Postgres tables + concurrency approach (advisory lock) for replica-safe coordination. |
| speckit/features/010-horizontal-scaling/contracts/db-schema.md | Migration contract for new coordination tables + required atomic operations. |
| speckit/features/010-horizontal-scaling/contracts/coordination-traits.md | Contract for new internal coordination traits and their selection rules. |
| speckit/features/010-horizontal-scaling/contracts/config-contract.md | Contract for env vars and prod-stage startup guards/log line for HA mode. |
| packages/miden-multisig-client/test-results/.last-run.json | Adds a test-results artifact file under the TS multisig client. |
| infra/variables.tf | Adds guardian_max_replicas Terraform variable and validation. |
| infra/ecs.tf | Injects GUARDIAN_MAX_REPLICAS into the ECS task definition environment. |
| infra/data.tf | Computes effective_guardian_max_replicas (clamped to autoscaling max capacity). |
| docs/SERVER_AWS_DEPLOY.md | Adds a section explaining HA behavior for the prod profile and links the runbook. |
| docs/runbooks/horizontal-scaling.md | New operator runbook for multi-replica HA deployments and trade-offs/failure modes. |
| docs/CONFIGURATION.md | Documents GUARDIAN_MAX_REPLICAS, prod-stage guards, and HA behavior notes. |
| crates/server/src/storage/postgres.rs | Serializes Diesel migrations across replicas using a Postgres advisory lock. |
| crates/server/src/schema.rs | Adds Diesel schema definitions for new coordination tables. |
| crates/server/src/middleware/rate_limit.rs | Partitions rate limits by GUARDIAN_MAX_REPLICAS + adds tests and warnings. |
| crates/server/src/main.rs | Threads coordination handles from storage builder into server builder. |
| crates/server/src/lib.rs | Exposes new config and coordination modules. |
| crates/server/src/jobs/canonicalization/worker.rs | Acquires/renews a shared lease and cancels work if leadership is lost. |
| crates/server/src/jobs/canonicalization/processor.rs | Adds fencing checks and cooperative cancellation checkpoints during canonicalization. |
| crates/server/src/evm/session.rs | Moves EVM auth challenges/sessions to coordination stores; adds sweep support. |
| crates/server/src/evm/mod.rs | Allows constructing EVM app state with injected session/challenge stores. |
| crates/server/src/dashboard/types.rs | Removes now-obsolete in-memory session/challenge record types. |
| crates/server/src/dashboard/state.rs | Moves operator challenges/sessions to coordination stores; adds prod cursor-secret guard and sweeps. |
| crates/server/src/coordination/session_store.rs | New SessionStore trait + in-memory implementation. |
| crates/server/src/coordination/postgres/session_store.rs | New Postgres SessionStore implementation backed by auth_sessions. |
| crates/server/src/coordination/postgres/mod.rs | New Postgres coordination module utilities and exports. |
| crates/server/src/coordination/postgres/lease.rs | New Postgres lease elector implementation backed by worker_leases. |
| crates/server/src/coordination/postgres/challenge_store.rs | New Postgres ChallengeStore implementation backed by auth_challenges. |
| crates/server/src/coordination/mod.rs | New coordination handle wiring + backend-derived coordination mode selection. |
| crates/server/src/coordination/leader.rs | New LeaderElector trait + AlwaysLeader in-memory implementation. |
| crates/server/src/coordination/challenge_store.rs | New ChallengeStore trait + in-memory implementation + payload encoding/decoding. |
| crates/server/src/config/stage.rs | Centralizes GUARDIAN_ENV=prod detection used by startup guards. |
| crates/server/src/config/mod.rs | New config module root. |
| crates/server/src/builder/storage.rs | Returns CoordinationHandles from storage builder (Postgres vs in-memory). |
| crates/server/src/builder/startup.rs | Logs a single “coordination” line describing resolved mode/backend/stage/max_replicas/secret source. |
| crates/server/src/builder/mod.rs | Wires coordination into dashboard/EVM state construction and adds prod guard for rate-limit partitioning to zero. |
| crates/server/src/builder/handle.rs | Starts canonicalization with a leader handle and adds a periodic session/challenge sweep worker. |
| crates/server/src/api/evm.rs | Adjusts EVM logout to handle possible revoke errors (currently logs + continues). |
| crates/server/src/api/dashboard.rs | Adjusts operator logout to handle possible revoke errors (currently logs + continues). |
| crates/server/src/ack/mod.rs | Reuses the new config::stage::is_prod() helper for ACK secret selection. |
| crates/server/migrations/2026-06-24-000001_worker_leases/up.sql | New migration creating worker_leases. |
| crates/server/migrations/2026-06-24-000001_worker_leases/down.sql | New migration dropping worker_leases. |
| crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql | New migration creating auth_challenges + indexes. |
| crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql | New migration dropping auth_challenges + indexes. |
| crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql | New migration creating auth_sessions + indexes. |
| crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql | New migration dropping auth_sessions + indexes. |
| crates/server/Cargo.toml | Adds tokio-util dependency for cancellation support in canonicalization. |
| Cargo.lock | Updates lockfile for new dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let token = extract_cookie(&headers, state.dashboard.cookie_name()); | ||
| state | ||
| if let Err(error) = state | ||
| .dashboard | ||
| .logout(token.as_deref(), state.clock.now()) | ||
| .await; | ||
| .await | ||
| { | ||
| tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed"); | ||
| } |
| let token = extract_cookie(&headers, state.evm.sessions.cookie_name()); | ||
| state | ||
| if let Err(error) = state | ||
| .evm | ||
| .sessions | ||
| .logout(token.as_deref(), state.clock.now()) | ||
| .await; | ||
| .await | ||
| { | ||
| tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed"); | ||
| } |
| if !cfg!(test) { | ||
| tracing::warn!( | ||
| "dashboard cursor secret not configured; generating ephemeral per-process \ | ||
| secret. Multi-replica deployments must set \ | ||
| GUARDIAN_DASHBOARD_CURSOR_SECRET to a stable shared 32-byte hex value." | ||
| ); |
| variable "guardian_max_replicas" { | ||
| description = <<-EOT | ||
| Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the | ||
| server divides global rate limits by. Defaults to the effective autoscaling | ||
| max capacity. Drives rate-limit partitioning only (coordination mode is | ||
| backend-derived). A value below the real max lets the aggregate exceed the | ||
| global limit, so an explicit override must be >= the autoscaling max. | ||
| EOT | ||
| type = number | ||
| default = null |
| { | ||
| "status": "passed", | ||
| "failedTests": [] | ||
| } No newline at end of file |
| validation { | ||
| condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1 | ||
| error_message = "guardian_max_replicas must be >= 1 when set." | ||
| } |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/server/src/jobs/canonicalization/processor.rs (2)
359-400: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftAvoid partial canonicalization when the lease drops mid-sequence.
submit_state(...)commits before the second lease check at Line 400. If the lease is lost after Line 359 but before Line 400, the processor aborts with the account state advanced while the delta remains non-canonical. This needs an atomic, fenced write path for state + canonical delta + metadata, or an explicit idempotent recovery strategy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 359 - 400, The canonicalization flow in `processor.rs` can leave state partially applied because `submit_state`, auth syncing, and the canonical delta update happen before the final lease validation. Update the `canonicalization` path around `ensure_lease_held`, `submit_state`, and `update_auth` so the state write and delta canonicalization are fenced atomically under a single lease check, or add an idempotent recovery mechanism that safely reconciles `updated_state`, `canonical_delta`, and metadata if the lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`, and the final `DeltaStatus::canonical` transition consistent with one another.
266-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winFence metadata writes as well as storage writes.
set_has_pending_candidate(...)andupdate_auth(...)are state-mutating writes but are not guarded immediately before execution. A replica can pass an earlier fence, lose the lease, then still mutate metadata. Addensure_lease_held(&delta).await?directly before each metadata mutation, or move these writes into a single fenced storage operation.Also applies to: 383-386, 409-412
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 266 - 280, The discard path in processor.rs mutates metadata after a storage delete without re-checking the lease, so a replica can stale-write state after losing ownership. In the discard flow around ensure_lease_held, call ensure_lease_held(&delta).await? immediately before set_has_pending_candidate, and apply the same fencing to every other state mutation in this file, especially update_auth in the other affected branches, or consolidate those writes into one fenced operation.
🧹 Nitpick comments (1)
crates/server/src/dashboard/state.rs (1)
871-901: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winMake the digest-keying test observe the inserted store key.
This test would still pass if the implementation regressed to storing and looking up plaintext tokens, because it only checks
session_digest(token) != tokenand then authenticates through the same production path. Inject a small recordingSessionStoreand assertinsert(...)receivessession_digest(token), never the raw token bytes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/dashboard/state.rs` around lines 871 - 901, The digest-keying test currently only proves the digest differs from the token and that authentication succeeds, so it can miss a regression where plaintext tokens are stored. Update the test around session_is_keyed_by_digest_not_plaintext_token to use a small recording SessionStore (or equivalent test double) so you can inspect the key passed into insert and confirm it matches crate::secret::session_digest(token) rather than the raw token bytes. Keep the existing round-trip check, but add a direct assertion on the recorded inserted key to verify DashboardState::verify stores only the digest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql`:
- Around line 7-16: The auth_sessions table is still keyed only by token_digest,
so realm-scoped sessions can collide across operator and EVM namespaces. Update
the migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.
In `@crates/server/src/api/dashboard.rs`:
- Around line 183-188: The logout flow in the dashboard API currently logs
revoke failures and still reports success, so it should fail closed instead.
Update the handler in the logout path that calls state.dashboard.logout(...) to
return an error response when revocation fails instead of continuing to a
success payload, and make sure the response contract reflects this change. If
the documented responses change, update the utoipa annotations on the relevant
endpoint and regenerate the committed OpenAPI specs with the project’s
gen-openapi command.
In `@crates/server/src/api/evm.rs`:
- Around line 183-190: The EVM logout handler currently swallows failures from
state.evm.sessions.logout and still returns success; restore the previous
fail-closed behavior by propagating that error out of the logout path instead of
only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...).
Update the logout endpoint logic in evm.rs so a revoke failure results in an
error response from the handler, and if you intentionally keep success responses
then also update the OpenAPI response docs/specs to match the new HTTP contract.
In `@crates/server/src/builder/mod.rs`:
- Around line 455-505: The Postgres coordination guard in the builder is too
late because it only runs inside the dashboard branch, allowing a manual
Postgres build with a custom dashboard to still proceed with AlwaysLeader and
per-process state. Move the missing-coordination check in the builder flow
immediately after `let coordination = self.coordination;` in
`crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or
`dashboard` are derived, so any Postgres build without coordination handles
fails closed regardless of how `dashboard` is provided.
In `@crates/server/src/config/stage.rs`:
- Around line 9-11: `is_prod` currently checks ENV_GUARDIAN_ENV directly, so
values like "prod " are treated as non-production and bypass startup guards.
Update the logic in `stage.rs` to trim the environment value before comparing it
in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for
`VarError::NotPresent` and other errors unchanged.
In `@crates/server/src/coordination/postgres/challenge_store.rs`:
- Around line 76-121: The issue is that concurrent issue() calls in
ChallengeStore::issue can bypass the per-principal cap because each transaction
sees stale outstanding counts before commit. Add a transaction-scoped lock keyed
by (realm, principal) at the start of the transaction in
PostgresChallengeStore::issue, before the INSERT/DELETE cleanup sequence, so
only one issuance for the same principal runs at a time. Use the existing
transaction block around conn.transaction and place the lock before the
auth_challenges statements that enforce max_outstanding.
In `@crates/server/src/coordination/postgres/lease.rs`:
- Around line 55-69: The lease flow in acquire/release is resetting fence_token
instead of keeping it strictly increasing, which breaks monotonic fencing across
reacquire. Update the lease handling in the try_acquire logic and the release
method so a holder’s reacquisition always advances fence_token, and avoid
deleting the worker_leases row on release if that would reset the counter. Use
the existing worker_leases upsert and fence_token update logic as the place to
enforce the monotonic increment on every acquisition path.
In `@crates/server/src/middleware/rate_limit.rs`:
- Around line 463-465: The env-mutation lock is not being applied consistently
across the rate-limit test suite, so `test_rate_limit_layer_from_env` can still
race with other tests. Update that test to acquire `ENV_LOCK` before touching
shared environment state, and make sure it also clears `GUARDIAN_MAX_REPLICAS`
along with the existing rate-limit env vars. Review the other env-mutating tests
in `rate_limit.rs` and wrap them with the same `ENV_LOCK` pattern so all
shared-process-environment changes are serialized.
In `@crates/server/src/storage/postgres.rs`:
- Around line 45-51: The migration lock acquisition in the postgres startup path
uses pg_advisory_lock and can block forever, so update the lock logic in the
connection/setup flow around the advisory lock call to fail fast instead of
waiting indefinitely. Use pg_try_advisory_lock with a retry deadline/backoff, or
apply a statement/lock timeout before attempting the lock, and keep the existing
error propagation through the migration lock acquisition code so startup can
recover when the lock is held too long.
In `@docs/CONFIGURATION.md`:
- Line 288: The Multi-replica (HA) configuration row overstates what the prod
Terraform profile currently provides; update the wording in the documentation
table to reflect only the variables actually injected by the ECS task
definition. Use the existing configuration entry for the HA quick-start to make
it clear that `GUARDIAN_MAX_REPLICAS` is supplied, while
`GUARDIAN_DASHBOARD_CURSOR_SECRET` must be set separately and is not wired by
the prod Terraform profile.
In `@docs/SERVER_AWS_DEPLOY.md`:
- Around line 556-558: Update the Terraform/runtime docs to use the exact
`effective_guardian_max_replicas` terminology and clarify the
`GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain
that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from
`effective_server_autoscaling_max_capacity`, can be overridden, and is clamped
upward) rather than implying a direct autoscaling-max-only mapping, so the
wording matches the deployed behavior.
In `@infra/variables.tf`:
- Around line 305-318: The guardian_max_replicas variable currently accepts
fractional numbers, but infra/ecs.tf stringifies it and
RateLimitConfig::from_env later parses it as u32, so non-integers can fall back
to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.
In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md`:
- Around line 40-42: The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.
In `@speckit/features/010-horizontal-scaling/data-model.md`:
- Line 14: The document has a heading level jump because the “Migration
concurrency (multi-replica startup) — REQUIRED” section uses a third-level
heading directly under the H1. Update that heading in the data-model markdown to
the correct level so the outline stays consistent and markdownlint MD001 is
satisfied.
In `@speckit/features/010-horizontal-scaling/spec.md`:
- Around line 413-417: The shared-state assumption in the horizontal-scaling
spec incorrectly includes rate-limit counters, which conflicts with FR-010 and
the later dependency text. Update the sentence in the spec so the
Postgres-backed shared store is described for sessions, challenges, leadership,
and cursors only, and remove any mention of rate-limit counters from that shared
coordination/state store assumption.
---
Outside diff comments:
In `@crates/server/src/jobs/canonicalization/processor.rs`:
- Around line 359-400: The canonicalization flow in `processor.rs` can leave
state partially applied because `submit_state`, auth syncing, and the canonical
delta update happen before the final lease validation. Update the
`canonicalization` path around `ensure_lease_held`, `submit_state`, and
`update_auth` so the state write and delta canonicalization are fenced
atomically under a single lease check, or add an idempotent recovery mechanism
that safely reconciles `updated_state`, `canonical_delta`, and metadata if the
lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`,
and the final `DeltaStatus::canonical` transition consistent with one another.
- Around line 266-280: The discard path in processor.rs mutates metadata after a
storage delete without re-checking the lease, so a replica can stale-write state
after losing ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.
---
Nitpick comments:
In `@crates/server/src/dashboard/state.rs`:
- Around line 871-901: The digest-keying test currently only proves the digest
differs from the token and that authentication succeeds, so it can miss a
regression where plaintext tokens are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 58f01332-ef5e-48a4-a152-748af84dfb23
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (47)
crates/server/Cargo.tomlcrates/server/migrations/2026-06-23-000001_auth_sessions/down.sqlcrates/server/migrations/2026-06-23-000001_auth_sessions/up.sqlcrates/server/migrations/2026-06-23-000002_auth_challenges/down.sqlcrates/server/migrations/2026-06-23-000002_auth_challenges/up.sqlcrates/server/migrations/2026-06-24-000001_worker_leases/down.sqlcrates/server/migrations/2026-06-24-000001_worker_leases/up.sqlcrates/server/src/ack/mod.rscrates/server/src/api/dashboard.rscrates/server/src/api/evm.rscrates/server/src/builder/handle.rscrates/server/src/builder/mod.rscrates/server/src/builder/startup.rscrates/server/src/builder/storage.rscrates/server/src/config/mod.rscrates/server/src/config/stage.rscrates/server/src/coordination/challenge_store.rscrates/server/src/coordination/leader.rscrates/server/src/coordination/mod.rscrates/server/src/coordination/postgres/challenge_store.rscrates/server/src/coordination/postgres/lease.rscrates/server/src/coordination/postgres/mod.rscrates/server/src/coordination/postgres/session_store.rscrates/server/src/coordination/session_store.rscrates/server/src/dashboard/state.rscrates/server/src/dashboard/types.rscrates/server/src/evm/mod.rscrates/server/src/evm/session.rscrates/server/src/jobs/canonicalization/processor.rscrates/server/src/jobs/canonicalization/worker.rscrates/server/src/lib.rscrates/server/src/main.rscrates/server/src/middleware/rate_limit.rscrates/server/src/schema.rscrates/server/src/storage/postgres.rsdocs/CONFIGURATION.mddocs/SERVER_AWS_DEPLOY.mddocs/runbooks/horizontal-scaling.mdinfra/data.tfinfra/ecs.tfinfra/variables.tfpackages/miden-multisig-client/test-results/.last-run.jsonspeckit/features/010-horizontal-scaling/contracts/config-contract.mdspeckit/features/010-horizontal-scaling/contracts/coordination-traits.mdspeckit/features/010-horizontal-scaling/contracts/db-schema.mdspeckit/features/010-horizontal-scaling/data-model.mdspeckit/features/010-horizontal-scaling/spec.md
💤 Files with no reviewable changes (1)
- crates/server/src/dashboard/types.rs
| CREATE TABLE auth_sessions ( | ||
| token_digest BYTEA PRIMARY KEY, | ||
| realm TEXT NOT NULL, | ||
| subject JSONB NOT NULL, | ||
| issued_at TIMESTAMPTZ NOT NULL, | ||
| expires_at TIMESTAMPTZ NOT NULL, | ||
| -- Set on logout; the row is kept until natural expiry so the revocation | ||
| -- is honored fleet-wide for as long as the token would have been valid. | ||
| revoked_at TIMESTAMPTZ NULL | ||
| ); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Make the session key realm-scoped in the database.
The table comment says operator and EVM sessions are isolated by realm, but the actual key is still token_digest alone here, and crates/server/src/schema.rs:111-120 mirrors that shape. If both realms ever mint the same token bytes, one insert will fail even though the contract says they should coexist. Make the primary key (realm, token_digest) so the DB enforces the namespace boundary instead of relying on token randomness.
Suggested direction
CREATE TABLE auth_sessions (
- token_digest BYTEA PRIMARY KEY,
- realm TEXT NOT NULL,
+ realm TEXT NOT NULL,
+ token_digest BYTEA NOT NULL,
subject JSONB NOT NULL,
issued_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
-- Set on logout; the row is kept until natural expiry so the revocation
-- is honored fleet-wide for as long as the token would have been valid.
- revoked_at TIMESTAMPTZ NULL
+ revoked_at TIMESTAMPTZ NULL,
+ PRIMARY KEY (realm, token_digest)
);This also needs the matching Diesel schema and any SQL built around the current single-column key updated in the same PR.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CREATE TABLE auth_sessions ( | |
| token_digest BYTEA PRIMARY KEY, | |
| realm TEXT NOT NULL, | |
| subject JSONB NOT NULL, | |
| issued_at TIMESTAMPTZ NOT NULL, | |
| expires_at TIMESTAMPTZ NOT NULL, | |
| -- Set on logout; the row is kept until natural expiry so the revocation | |
| -- is honored fleet-wide for as long as the token would have been valid. | |
| revoked_at TIMESTAMPTZ NULL | |
| ); | |
| CREATE TABLE auth_sessions ( | |
| realm TEXT NOT NULL, | |
| token_digest BYTEA NOT NULL, | |
| subject JSONB NOT NULL, | |
| issued_at TIMESTAMPTZ NOT NULL, | |
| expires_at TIMESTAMPTZ NOT NULL, | |
| -- Set on logout; the row is kept until natural expiry so the revocation | |
| -- is honored fleet-wide for as long as the token would have been valid. | |
| revoked_at TIMESTAMPTZ NULL, | |
| PRIMARY KEY (realm, token_digest) | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql` around lines
7 - 16, The auth_sessions table is still keyed only by token_digest, so
realm-scoped sessions can collide across operator and EVM namespaces. Update the
migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.
| if let Err(error) = state | ||
| .dashboard | ||
| .logout(token.as_deref(), state.clock.now()) | ||
| .await; | ||
| .await | ||
| { | ||
| tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed"); |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Fail closed when session revocation fails.
Returning success: true after a store revoke error can leave the server-side operator session valid until expiry, which is especially risky with shared Postgres sessions. Return an error response instead of reporting a successful logout. If this changes the documented response set, update the utoipa annotations and regenerated specs. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/server/src/api/dashboard.rs` around lines 183 - 188, The logout flow
in the dashboard API currently logs revoke failures and still reports success,
so it should fail closed instead. Update the handler in the logout path that
calls state.dashboard.logout(...) to return an error response when revocation
fails instead of continuing to a success payload, and make sure the response
contract reflects this change. If the documented responses change, update the
utoipa annotations on the relevant endpoint and regenerate the committed OpenAPI
specs with the project’s gen-openapi command.
Source: Coding guidelines
| if let Err(error) = state | ||
| .evm | ||
| .sessions | ||
| .logout(token.as_deref(), state.clock.now()) | ||
| .await; | ||
| .await | ||
| { | ||
| tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed"); | ||
| } |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Propagate EVM logout revoke failures.
This now reports logout success even if the backing session store failed to revoke the token, leaving the session usable until expiry. Keep the previous fail-closed behavior.
Suggested fix
- if let Err(error) = state
- .evm
- .sessions
- .logout(token.as_deref(), state.clock.now())
- .await
- {
- tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
- }
+ state
+ .evm
+ .sessions
+ .logout(token.as_deref(), state.clock.now())
+ .await?;If the intended HTTP contract is to always return success, update the OpenAPI response documentation accordingly. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Err(error) = state | |
| .evm | |
| .sessions | |
| .logout(token.as_deref(), state.clock.now()) | |
| .await; | |
| .await | |
| { | |
| tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed"); | |
| } | |
| state | |
| .evm | |
| .sessions | |
| .logout(token.as_deref(), state.clock.now()) | |
| .await?; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/server/src/api/evm.rs` around lines 183 - 190, The EVM logout handler
currently swallows failures from state.evm.sessions.logout and still returns
success; restore the previous fail-closed behavior by propagating that error out
of the logout path instead of only logging it with tracing::warn!(auth_event =
"evm_logout_failed", ...). Update the logout endpoint logic in evm.rs so a
revoke failure results in an error response from the handler, and if you
intentionally keep success responses then also update the OpenAPI response
docs/specs to match the new HTTP contract.
Source: Coding guidelines
| let coordination = self.coordination; | ||
| let coordination_mode = coordination | ||
| .as_ref() | ||
| .map(|handles| handles.mode) | ||
| .unwrap_or(crate::coordination::CoordinationMode::SingleProcess); | ||
| let leader: Arc<dyn crate::coordination::LeaderElector> = coordination | ||
| .as_ref() | ||
| .map(|handles| handles.leader.clone()) | ||
| .unwrap_or_else(|| { | ||
| Arc::new(crate::coordination::AlwaysLeader::new( | ||
| crate::coordination::CANONICALIZATION_LEASE, | ||
| "single-process", | ||
| )) | ||
| }); | ||
| let dashboard = match self.dashboard { | ||
| Some(dashboard) => dashboard, | ||
| None => Arc::new(DashboardState::from_env_for_network(network_type).await?), | ||
| None => match coordination.as_ref() { | ||
| Some(handles) => Arc::new( | ||
| DashboardState::from_env_for_network_with_stores( | ||
| network_type, | ||
| handles.operator_sessions.clone(), | ||
| handles.operator_challenges.clone(), | ||
| ) | ||
| .await?, | ||
| ), | ||
| // Fail closed: the Postgres backend must not silently fall back to | ||
| // per-process dashboard state. Coordination handles are populated | ||
| // by StorageMetadataBuilder::build(); their absence here means a | ||
| // manual builder skipped them. | ||
| None if storage.kind() == crate::storage::StorageType::Postgres => { | ||
| return Err( | ||
| "Postgres storage requires coordination handles for shared dashboard \ | ||
| sessions/challenges; call .coordination(...) (populated by \ | ||
| StorageMetadataBuilder::build())" | ||
| .to_string(), | ||
| ); | ||
| } | ||
| None => Arc::new(DashboardState::from_env_for_network(network_type).await?), | ||
| }, | ||
| }; | ||
| #[cfg(feature = "evm")] | ||
| let evm = Arc::new(EvmAppState::from_env().await?); | ||
| let evm = { | ||
| let sessions = match coordination.as_ref() { | ||
| Some(handles) => crate::evm::EvmSessionState::new( | ||
| handles.evm_sessions.clone(), | ||
| handles.evm_challenges.clone(), | ||
| ), | ||
| None => crate::evm::EvmSessionState::default(), | ||
| }; | ||
| Arc::new(EvmAppState::from_env_with_sessions(sessions).await?) | ||
| }; |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Fail closed before accepting any Postgres build without coordination.
Line 484 only rejects missing coordination when self.dashboard is None. A manual Postgres builder that supplies a custom dashboard still falls through with AlwaysLeader, so multiple replicas can run canonicalization as leaders and use non-shared session state. Move the Postgres/coordination check immediately after let coordination = self.coordination;.
Proposed fix
let ack = self.ack.ok_or("AckRegistry not set. Use .ack(...)")?;
let coordination = self.coordination;
+ if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres {
+ return Err(
+ "Postgres storage requires coordination handles for shared sessions/challenges \
+ and canonicalization leadership; call .coordination(...) (populated by \
+ StorageMetadataBuilder::build())"
+ .to_string(),
+ );
+ }
let coordination_mode = coordination
.as_ref()
.map(|handles| handles.mode)
.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let coordination = self.coordination; | |
| let coordination_mode = coordination | |
| .as_ref() | |
| .map(|handles| handles.mode) | |
| .unwrap_or(crate::coordination::CoordinationMode::SingleProcess); | |
| let leader: Arc<dyn crate::coordination::LeaderElector> = coordination | |
| .as_ref() | |
| .map(|handles| handles.leader.clone()) | |
| .unwrap_or_else(|| { | |
| Arc::new(crate::coordination::AlwaysLeader::new( | |
| crate::coordination::CANONICALIZATION_LEASE, | |
| "single-process", | |
| )) | |
| }); | |
| let dashboard = match self.dashboard { | |
| Some(dashboard) => dashboard, | |
| None => Arc::new(DashboardState::from_env_for_network(network_type).await?), | |
| None => match coordination.as_ref() { | |
| Some(handles) => Arc::new( | |
| DashboardState::from_env_for_network_with_stores( | |
| network_type, | |
| handles.operator_sessions.clone(), | |
| handles.operator_challenges.clone(), | |
| ) | |
| .await?, | |
| ), | |
| // Fail closed: the Postgres backend must not silently fall back to | |
| // per-process dashboard state. Coordination handles are populated | |
| // by StorageMetadataBuilder::build(); their absence here means a | |
| // manual builder skipped them. | |
| None if storage.kind() == crate::storage::StorageType::Postgres => { | |
| return Err( | |
| "Postgres storage requires coordination handles for shared dashboard \ | |
| sessions/challenges; call .coordination(...) (populated by \ | |
| StorageMetadataBuilder::build())" | |
| .to_string(), | |
| ); | |
| } | |
| None => Arc::new(DashboardState::from_env_for_network(network_type).await?), | |
| }, | |
| }; | |
| #[cfg(feature = "evm")] | |
| let evm = Arc::new(EvmAppState::from_env().await?); | |
| let evm = { | |
| let sessions = match coordination.as_ref() { | |
| Some(handles) => crate::evm::EvmSessionState::new( | |
| handles.evm_sessions.clone(), | |
| handles.evm_challenges.clone(), | |
| ), | |
| None => crate::evm::EvmSessionState::default(), | |
| }; | |
| Arc::new(EvmAppState::from_env_with_sessions(sessions).await?) | |
| }; | |
| let coordination = self.coordination; | |
| if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres { | |
| return Err( | |
| "Postgres storage requires coordination handles for shared sessions/challenges \ | |
| and canonicalization leadership; call .coordination(...) (populated by \ | |
| StorageMetadataBuilder::build())" | |
| .to_string(), | |
| ); | |
| } | |
| let coordination_mode = coordination | |
| .as_ref() | |
| .map(|handles| handles.mode) | |
| .unwrap_or(crate::coordination::CoordinationMode::SingleProcess); | |
| let leader: Arc<dyn crate::coordination::LeaderElector> = coordination | |
| .as_ref() | |
| .map(|handles| handles.leader.clone()) | |
| .unwrap_or_else(|| { | |
| Arc::new(crate::coordination::AlwaysLeader::new( | |
| crate::coordination::CANONICALIZATION_LEASE, | |
| "single-process", | |
| )) | |
| }); | |
| let dashboard = match self.dashboard { | |
| Some(dashboard) => dashboard, | |
| None => match coordination.as_ref() { | |
| Some(handles) => Arc::new( | |
| DashboardState::from_env_for_network_with_stores( | |
| network_type, | |
| handles.operator_sessions.clone(), | |
| handles.operator_challenges.clone(), | |
| ) | |
| .await?, | |
| ), | |
| // Fail closed: the Postgres backend must not silently fall back to | |
| // per-process dashboard state. Coordination handles are populated | |
| // by StorageMetadataBuilder::build(); their absence here means a | |
| // manual builder skipped them. | |
| None if storage.kind() == crate::storage::StorageType::Postgres => { | |
| return Err( | |
| "Postgres storage requires coordination handles for shared dashboard \ | |
| sessions/challenges; call .coordination(...) (populated by \ | |
| StorageMetadataBuilder::build())" | |
| .to_string(), | |
| ); | |
| } | |
| None => Arc::new(DashboardState::from_env_for_network(network_type).await?), | |
| }, | |
| }; | |
| #[cfg(feature = "evm")] | |
| let evm = { | |
| let sessions = match coordination.as_ref() { | |
| Some(handles) => crate::evm::EvmSessionState::new( | |
| handles.evm_sessions.clone(), | |
| handles.evm_challenges.clone(), | |
| ), | |
| None => crate::evm::EvmSessionState::default(), | |
| }; | |
| Arc::new(EvmAppState::from_env_with_sessions(sessions).await?) | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/server/src/builder/mod.rs` around lines 455 - 505, The Postgres
coordination guard in the builder is too late because it only runs inside the
dashboard branch, allowing a manual Postgres build with a custom dashboard to
still proceed with AlwaysLeader and per-process state. Move the
missing-coordination check in the builder flow immediately after `let
coordination = self.coordination;` in `crates/server/src/builder/mod.rs`, before
`coordination_mode`, `leader`, or `dashboard` are derived, so any Postgres build
without coordination handles fails closed regardless of how `dashboard` is
provided.
| match std::env::var(ENV_GUARDIAN_ENV) { | ||
| Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)), | ||
| Err(std::env::VarError::NotPresent) => Ok(false), |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Trim GUARDIAN_ENV before applying production guards.
GUARDIAN_ENV="prod " currently evaluates as non-prod, which can skip prod-only startup guards such as rejecting filesystem-backed deployments.
Proposed fix
- Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),
+ Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| match std::env::var(ENV_GUARDIAN_ENV) { | |
| Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)), | |
| Err(std::env::VarError::NotPresent) => Ok(false), | |
| match std::env::var(ENV_GUARDIAN_ENV) { | |
| Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)), | |
| Err(std::env::VarError::NotPresent) => Ok(false), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/server/src/config/stage.rs` around lines 9 - 11, `is_prod` currently
checks ENV_GUARDIAN_ENV directly, so values like "prod " are treated as
non-production and bypass startup guards. Update the logic in `stage.rs` to trim
the environment value before comparing it in `eq_ignore_ascii_case(PROD_ENV)`,
keeping the existing handling for `VarError::NotPresent` and other errors
unchanged.
| a time. Terraform also sets `GUARDIAN_MAX_REPLICAS` from | ||
| `effective_server_autoscaling_max_capacity` (`max(desired, 6)`) so rate limits | ||
| are partitioned across the fleet, and requires `GUARDIAN_DASHBOARD_CURSOR_SECRET`. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use effective_guardian_max_replicas wording to match actual Terraform behavior.
This currently implies a direct mapping from autoscaling max only. Please document that GUARDIAN_MAX_REPLICAS is set from the effective value (default from autoscaling max, but overrideable and clamped upward), so operator guidance matches the deployed contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/SERVER_AWS_DEPLOY.md` around lines 556 - 558, Update the
Terraform/runtime docs to use the exact `effective_guardian_max_replicas`
terminology and clarify the `GUARDIAN_MAX_REPLICAS` contract. In the
`SERVER_AWS_DEPLOY` guidance, explain that `GUARDIAN_MAX_REPLICAS` comes from
the effective value (derived from `effective_server_autoscaling_max_capacity`,
can be overridden, and is clamped upward) rather than implying a direct
autoscaling-max-only mapping, so the wording matches the deployed behavior.
| variable "guardian_max_replicas" { | ||
| description = <<-EOT | ||
| Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the | ||
| server divides global rate limits by. Defaults to the effective autoscaling | ||
| max capacity. Drives rate-limit partitioning only (coordination mode is | ||
| backend-derived). A value below the real max lets the aggregate exceed the | ||
| global limit, so an explicit override must be >= the autoscaling max. | ||
| EOT | ||
| type = number | ||
| default = null | ||
|
|
||
| validation { | ||
| condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1 | ||
| error_message = "guardian_max_replicas must be >= 1 when set." |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Restrict guardian_max_replicas to whole numbers.
type = number currently allows values like 2.5. That gets stringified in infra/ecs.tf and then RateLimitConfig::from_env parses it as u32; a fractional value fails parsing and silently falls back to 1, which disables the intended fleet-wide partitioning and makes the aggregate limit too loose.
Suggested fix
variable "guardian_max_replicas" {
@@
validation {
- condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
- error_message = "guardian_max_replicas must be >= 1 when set."
+ condition = var.guardian_max_replicas == null || (
+ var.guardian_max_replicas >= 1 &&
+ floor(var.guardian_max_replicas) == var.guardian_max_replicas
+ )
+ error_message = "guardian_max_replicas must be a whole number >= 1 when set."
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| variable "guardian_max_replicas" { | |
| description = <<-EOT | |
| Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the | |
| server divides global rate limits by. Defaults to the effective autoscaling | |
| max capacity. Drives rate-limit partitioning only (coordination mode is | |
| backend-derived). A value below the real max lets the aggregate exceed the | |
| global limit, so an explicit override must be >= the autoscaling max. | |
| EOT | |
| type = number | |
| default = null | |
| validation { | |
| condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1 | |
| error_message = "guardian_max_replicas must be >= 1 when set." | |
| variable "guardian_max_replicas" { | |
| description = <<-EOT | |
| Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the | |
| server divides global rate limits by. Defaults to the effective autoscaling | |
| max capacity. Drives rate-limit partitioning only (coordination mode is | |
| backend-derived). A value below the real max lets the aggregate exceed the | |
| global limit, so an explicit override must be >= the autoscaling max. | |
| EOT | |
| type = number | |
| default = null | |
| validation { | |
| condition = var.guardian_max_replicas == null || ( | |
| var.guardian_max_replicas >= 1 && | |
| floor(var.guardian_max_replicas) == var.guardian_max_replicas | |
| ) | |
| error_message = "guardian_max_replicas must be a whole number >= 1 when set." | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infra/variables.tf` around lines 305 - 318, The guardian_max_replicas
variable currently accepts fractional numbers, but infra/ecs.tf stringifies it
and RateLimitConfig::from_env later parses it as u32, so non-integers can fall
back to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.
| where `local.effective_guardian_max_replicas = var.guardian_max_replicas != null ? var.guardian_max_replicas : local.effective_server_autoscaling_max_capacity` | ||
| (new `var.guardian_max_replicas` defaults to `null`, i.e. derive from max | ||
| capacity; operators may override). |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
The effective_guardian_max_replicas formula here is missing the upward clamp.
Please update this expression to reflect the implemented contract (explicit override cannot go below autoscaling max). Right now the example allows a too-low override unless separately validated, which conflicts with the runbook behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md` around
lines 40 - 42, The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.
| All timestamps are `TIMESTAMPTZ` and all expiry comparisons use the database | ||
| clock (`now()`), giving a single authoritative clock across replicas. | ||
|
|
||
| ### Migration concurrency (multi-replica startup) — REQUIRED |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the heading level jump.
This starts at ### directly under the H1 section, which trips markdownlint MD001 and makes the outline inconsistent.
Suggested fix
-### Migration concurrency (multi-replica startup) — REQUIRED
+## Migration concurrency (multi-replica startup) — REQUIRED📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Migration concurrency (multi-replica startup) — REQUIRED | |
| ## Migration concurrency (multi-replica startup) — REQUIRED |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 14-14: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@speckit/features/010-horizontal-scaling/data-model.md` at line 14, The
document has a heading level jump because the “Migration concurrency
(multi-replica startup) — REQUIRED” section uses a third-level heading directly
under the H1. Update that heading in the data-model markdown to the correct
level so the outline stays consistent and markdownlint MD001 is satisfied.
Source: Linters/SAST tools
| - The shipped production image is built with the Postgres storage backend, so a | ||
| shared relational database is available to replicas and is the natural shared | ||
| coordination/state store for sessions, challenges, leadership, cursors, and | ||
| rate-limit counters. No new infrastructure component (e.g. a separate cache or | ||
| queue) is assumed to be mandatory; if one is proposed it will be justified in |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Remove rate-limit counters from this assumption.
This sentence conflicts with FR-010 and the later dependency text, which explicitly say rate limiting stays per-process and does not add a shared-store dependency. Leaving both statements in the spec makes the contract ambiguous.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@speckit/features/010-horizontal-scaling/spec.md` around lines 413 - 417, The
shared-state assumption in the horizontal-scaling spec incorrectly includes
rate-limit counters, which conflicts with FR-010 and the later dependency text.
Update the sentence in the spec so the Postgres-backed shared store is described
for sessions, challenges, leadership, and cursors only, and remove any mention
of rate-limit counters from that shared coordination/state store assumption.
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (72.76%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #301 +/- ##
==========================================
- Coverage 76.88% 76.47% -0.42%
==========================================
Files 153 159 +6
Lines 27382 28274 +892
==========================================
+ Hits 21054 21622 +568
- Misses 6328 6652 +324 Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
- Attribute cursor-secret enforcement to the Compose ${VAR:?} expansion rather
than a server-side prod guard (the server cursor secret is optional on main;
the hard requirement lands with #301).
- Production compose: require an explicit GUARDIAN_VERSION (drop the :latest
default) and bind the metrics port to loopback (127.0.0.1:9464).
- Track B smoke: drop the unconfirmed "storage encryption" log grep; rely on
ECDSA-signer-ready + clean startup.
- Remove the design-spec artifact from the PR (brainstorming doc, not repo
content).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary by CodeRabbit
New Features
Bug Fixes
Documentation