feat: 010 scalability improvements by zeljkoX · Pull Request #301 · OpenZeppelin/guardian

zeljkoX · 2026-06-24T10:54:59Z

Summary by CodeRabbit

New Features
- Added horizontal-scaling support for shared sessions, login challenges, and worker coordination across replicas.
- Introduced fleet-wide rate limiting with per-replica partitioning and new startup visibility into coordination mode.
Bug Fixes
- Logout flows now continue to succeed even if session revocation hits a temporary backend error.
- Production startup now fails fast when required shared settings are missing or the filesystem backend is used.
Documentation
- Updated configuration, deployment, and ops runbooks for multi-replica production setups.

coderabbitai · 2026-06-24T10:55:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d1145e5-746a-42ca-ba01-86cb4e02a207

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 010-scalability-improvements

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR implements the first slice of feature 010 horizontal scaling correctness for the Guardian server by moving previously per-process coordination state (auth sessions/challenges and canonicalization ownership) into shared Postgres-backed coordination, adding per-replica rate-limit partitioning keyed by an infra-sourced max replica capacity, and documenting the required HA configuration.

Changes:

Introduces a coordination subsystem with in-memory (filesystem/dev) and Postgres (prod) implementations for sessions, challenges, and canonicalization leadership.
Updates canonicalization to run under a shared lease with renewal + fencing checks, and adds periodic sweeps for expired sessions/challenges.
Wires GUARDIAN_MAX_REPLICAS through Terraform and documents HA/runbook guidance + prod-stage startup guards.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
speckit/features/010-horizontal-scaling/spec.md	New feature spec describing HA correctness goals, scenarios, and requirements.
speckit/features/010-horizontal-scaling/data-model.md	Defines Postgres tables + concurrency approach (advisory lock) for replica-safe coordination.
speckit/features/010-horizontal-scaling/contracts/db-schema.md	Migration contract for new coordination tables + required atomic operations.
speckit/features/010-horizontal-scaling/contracts/coordination-traits.md	Contract for new internal coordination traits and their selection rules.
speckit/features/010-horizontal-scaling/contracts/config-contract.md	Contract for env vars and prod-stage startup guards/log line for HA mode.
packages/miden-multisig-client/test-results/.last-run.json	Adds a test-results artifact file under the TS multisig client.
infra/variables.tf	Adds `guardian_max_replicas` Terraform variable and validation.
infra/ecs.tf	Injects `GUARDIAN_MAX_REPLICAS` into the ECS task definition environment.
infra/data.tf	Computes `effective_guardian_max_replicas` (clamped to autoscaling max capacity).
docs/SERVER_AWS_DEPLOY.md	Adds a section explaining HA behavior for the prod profile and links the runbook.
docs/runbooks/horizontal-scaling.md	New operator runbook for multi-replica HA deployments and trade-offs/failure modes.
docs/CONFIGURATION.md	Documents `GUARDIAN_MAX_REPLICAS`, prod-stage guards, and HA behavior notes.
crates/server/src/storage/postgres.rs	Serializes Diesel migrations across replicas using a Postgres advisory lock.
crates/server/src/schema.rs	Adds Diesel schema definitions for new coordination tables.
crates/server/src/middleware/rate_limit.rs	Partitions rate limits by `GUARDIAN_MAX_REPLICAS` + adds tests and warnings.
crates/server/src/main.rs	Threads coordination handles from storage builder into server builder.
crates/server/src/lib.rs	Exposes new `config` and `coordination` modules.
crates/server/src/jobs/canonicalization/worker.rs	Acquires/renews a shared lease and cancels work if leadership is lost.
crates/server/src/jobs/canonicalization/processor.rs	Adds fencing checks and cooperative cancellation checkpoints during canonicalization.
crates/server/src/evm/session.rs	Moves EVM auth challenges/sessions to coordination stores; adds sweep support.
crates/server/src/evm/mod.rs	Allows constructing EVM app state with injected session/challenge stores.
crates/server/src/dashboard/types.rs	Removes now-obsolete in-memory session/challenge record types.
crates/server/src/dashboard/state.rs	Moves operator challenges/sessions to coordination stores; adds prod cursor-secret guard and sweeps.
crates/server/src/coordination/session_store.rs	New `SessionStore` trait + in-memory implementation.
crates/server/src/coordination/postgres/session_store.rs	New Postgres `SessionStore` implementation backed by `auth_sessions`.
crates/server/src/coordination/postgres/mod.rs	New Postgres coordination module utilities and exports.
crates/server/src/coordination/postgres/lease.rs	New Postgres lease elector implementation backed by `worker_leases`.
crates/server/src/coordination/postgres/challenge_store.rs	New Postgres `ChallengeStore` implementation backed by `auth_challenges`.
crates/server/src/coordination/mod.rs	New coordination handle wiring + backend-derived coordination mode selection.
crates/server/src/coordination/leader.rs	New `LeaderElector` trait + `AlwaysLeader` in-memory implementation.
crates/server/src/coordination/challenge_store.rs	New `ChallengeStore` trait + in-memory implementation + payload encoding/decoding.
crates/server/src/config/stage.rs	Centralizes `GUARDIAN_ENV=prod` detection used by startup guards.
crates/server/src/config/mod.rs	New config module root.
crates/server/src/builder/storage.rs	Returns `CoordinationHandles` from storage builder (Postgres vs in-memory).
crates/server/src/builder/startup.rs	Logs a single “coordination” line describing resolved mode/backend/stage/max_replicas/secret source.
crates/server/src/builder/mod.rs	Wires coordination into dashboard/EVM state construction and adds prod guard for rate-limit partitioning to zero.
crates/server/src/builder/handle.rs	Starts canonicalization with a leader handle and adds a periodic session/challenge sweep worker.
crates/server/src/api/evm.rs	Adjusts EVM logout to handle possible revoke errors (currently logs + continues).
crates/server/src/api/dashboard.rs	Adjusts operator logout to handle possible revoke errors (currently logs + continues).
crates/server/src/ack/mod.rs	Reuses the new `config::stage::is_prod()` helper for ACK secret selection.
crates/server/migrations/2026-06-24-000001_worker_leases/up.sql	New migration creating `worker_leases`.
crates/server/migrations/2026-06-24-000001_worker_leases/down.sql	New migration dropping `worker_leases`.
crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql	New migration creating `auth_challenges` + indexes.
crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql	New migration dropping `auth_challenges` + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql	New migration creating `auth_sessions` + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql	New migration dropping `auth_sessions` + indexes.
crates/server/Cargo.toml	Adds `tokio-util` dependency for cancellation support in canonicalization.
Cargo.lock	Updates lockfile for new dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    let token = extract_cookie(&headers, state.dashboard.cookie_name());
-    state
+    if let Err(error) = state
        .dashboard
        .logout(token.as_deref(), state.clock.now())
-        .await;
+        .await
+    {
+        tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed");
+    }


    let token = extract_cookie(&headers, state.evm.sessions.cookie_name());
-    state
+    if let Err(error) = state
        .evm
        .sessions
        .logout(token.as_deref(), state.clock.now())
-        .await;
+        .await
+    {
+        tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
+    }


+                if !cfg!(test) {
+                    tracing::warn!(
+                        "dashboard cursor secret not configured; generating ephemeral per-process \
+                         secret. Multi-replica deployments must set \
+                         GUARDIAN_DASHBOARD_CURSOR_SECRET to a stable shared 32-byte hex value."
+                    );


+variable "guardian_max_replicas" {
+  description = <<-EOT
+    Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
+    server divides global rate limits by. Defaults to the effective autoscaling
+    max capacity. Drives rate-limit partitioning only (coordination mode is
+    backend-derived). A value below the real max lets the aggregate exceed the
+    global limit, so an explicit override must be >= the autoscaling max.
+  EOT
+  type        = number
+  default     = null


+{
+  "status": "passed",
+  "failedTests": []
+}


+  validation {
+    condition     = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
+    error_message = "guardian_max_replicas must be >= 1 when set."
+  }


coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/server/src/jobs/canonicalization/processor.rs (2)
359-400: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Avoid partial canonicalization when the lease drops mid-sequence.

submit_state(...) commits before the second lease check at Line 400. If the lease is lost after Line 359 but before Line 400, the processor aborts with the account state advanced while the delta remains non-canonical. This needs an atomic, fenced write path for state + canonical delta + metadata, or an explicit idempotent recovery strategy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 359 - 400,
The canonicalization flow in `processor.rs` can leave state partially applied
because `submit_state`, auth syncing, and the canonical delta update happen
before the final lease validation. Update the `canonicalization` path around
`ensure_lease_held`, `submit_state`, and `update_auth` so the state write and
delta canonicalization are fenced atomically under a single lease check, or add
an idempotent recovery mechanism that safely reconciles `updated_state`,
`canonical_delta`, and metadata if the lease is lost mid-sequence. Ensure the
fix keeps `submit_state`, `update_auth`, and the final `DeltaStatus::canonical`
transition consistent with one another.
266-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fence metadata writes as well as storage writes.

set_has_pending_candidate(...) and update_auth(...) are state-mutating writes but are not guarded immediately before execution. A replica can pass an earlier fence, lose the lease, then still mutate metadata. Add ensure_lease_held(&delta).await? directly before each metadata mutation, or move these writes into a single fenced storage operation.

Also applies to: 383-386, 409-412
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 266 - 280,
The discard path in processor.rs mutates metadata after a storage delete without
re-checking the lease, so a replica can stale-write state after losing
ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.

🧹 Nitpick comments (1)

crates/server/src/dashboard/state.rs (1)
871-901: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Make the digest-keying test observe the inserted store key.

This test would still pass if the implementation regressed to storing and looking up plaintext tokens, because it only checks session_digest(token) != token and then authenticates through the same production path. Inject a small recording SessionStore and assert insert(...) receives session_digest(token), never the raw token bytes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/dashboard/state.rs` around lines 871 - 901, The
digest-keying test currently only proves the digest differs from the token and
that authentication succeeds, so it can miss a regression where plaintext tokens
are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql`:
- Around line 7-16: The auth_sessions table is still keyed only by token_digest,
so realm-scoped sessions can collide across operator and EVM namespaces. Update
the migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.

In `@crates/server/src/api/dashboard.rs`:
- Around line 183-188: The logout flow in the dashboard API currently logs
revoke failures and still reports success, so it should fail closed instead.
Update the handler in the logout path that calls state.dashboard.logout(...) to
return an error response when revocation fails instead of continuing to a
success payload, and make sure the response contract reflects this change. If
the documented responses change, update the utoipa annotations on the relevant
endpoint and regenerate the committed OpenAPI specs with the project’s
gen-openapi command.

In `@crates/server/src/api/evm.rs`:
- Around line 183-190: The EVM logout handler currently swallows failures from
state.evm.sessions.logout and still returns success; restore the previous
fail-closed behavior by propagating that error out of the logout path instead of
only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...).
Update the logout endpoint logic in evm.rs so a revoke failure results in an
error response from the handler, and if you intentionally keep success responses
then also update the OpenAPI response docs/specs to match the new HTTP contract.

In `@crates/server/src/builder/mod.rs`:
- Around line 455-505: The Postgres coordination guard in the builder is too
late because it only runs inside the dashboard branch, allowing a manual
Postgres build with a custom dashboard to still proceed with AlwaysLeader and
per-process state. Move the missing-coordination check in the builder flow
immediately after `let coordination = self.coordination;` in
`crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or
`dashboard` are derived, so any Postgres build without coordination handles
fails closed regardless of how `dashboard` is provided.

In `@crates/server/src/config/stage.rs`:
- Around line 9-11: `is_prod` currently checks ENV_GUARDIAN_ENV directly, so
values like "prod " are treated as non-production and bypass startup guards.
Update the logic in `stage.rs` to trim the environment value before comparing it
in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for
`VarError::NotPresent` and other errors unchanged.

In `@crates/server/src/coordination/postgres/challenge_store.rs`:
- Around line 76-121: The issue is that concurrent issue() calls in
ChallengeStore::issue can bypass the per-principal cap because each transaction
sees stale outstanding counts before commit. Add a transaction-scoped lock keyed
by (realm, principal) at the start of the transaction in
PostgresChallengeStore::issue, before the INSERT/DELETE cleanup sequence, so
only one issuance for the same principal runs at a time. Use the existing
transaction block around conn.transaction and place the lock before the
auth_challenges statements that enforce max_outstanding.

In `@crates/server/src/coordination/postgres/lease.rs`:
- Around line 55-69: The lease flow in acquire/release is resetting fence_token
instead of keeping it strictly increasing, which breaks monotonic fencing across
reacquire. Update the lease handling in the try_acquire logic and the release
method so a holder’s reacquisition always advances fence_token, and avoid
deleting the worker_leases row on release if that would reset the counter. Use
the existing worker_leases upsert and fence_token update logic as the place to
enforce the monotonic increment on every acquisition path.

In `@crates/server/src/middleware/rate_limit.rs`:
- Around line 463-465: The env-mutation lock is not being applied consistently
across the rate-limit test suite, so `test_rate_limit_layer_from_env` can still
race with other tests. Update that test to acquire `ENV_LOCK` before touching
shared environment state, and make sure it also clears `GUARDIAN_MAX_REPLICAS`
along with the existing rate-limit env vars. Review the other env-mutating tests
in `rate_limit.rs` and wrap them with the same `ENV_LOCK` pattern so all
shared-process-environment changes are serialized.

In `@crates/server/src/storage/postgres.rs`:
- Around line 45-51: The migration lock acquisition in the postgres startup path
uses pg_advisory_lock and can block forever, so update the lock logic in the
connection/setup flow around the advisory lock call to fail fast instead of
waiting indefinitely. Use pg_try_advisory_lock with a retry deadline/backoff, or
apply a statement/lock timeout before attempting the lock, and keep the existing
error propagation through the migration lock acquisition code so startup can
recover when the lock is held too long.

In `@docs/CONFIGURATION.md`:
- Line 288: The Multi-replica (HA) configuration row overstates what the prod
Terraform profile currently provides; update the wording in the documentation
table to reflect only the variables actually injected by the ECS task
definition. Use the existing configuration entry for the HA quick-start to make
it clear that `GUARDIAN_MAX_REPLICAS` is supplied, while
`GUARDIAN_DASHBOARD_CURSOR_SECRET` must be set separately and is not wired by
the prod Terraform profile.

In `@docs/SERVER_AWS_DEPLOY.md`:
- Around line 556-558: Update the Terraform/runtime docs to use the exact
`effective_guardian_max_replicas` terminology and clarify the
`GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain
that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from
`effective_server_autoscaling_max_capacity`, can be overridden, and is clamped
upward) rather than implying a direct autoscaling-max-only mapping, so the
wording matches the deployed behavior.

In `@infra/variables.tf`:
- Around line 305-318: The guardian_max_replicas variable currently accepts
fractional numbers, but infra/ecs.tf stringifies it and
RateLimitConfig::from_env later parses it as u32, so non-integers can fall back
to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.

In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md`:
- Around line 40-42: The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.

In `@speckit/features/010-horizontal-scaling/data-model.md`:
- Line 14: The document has a heading level jump because the “Migration
concurrency (multi-replica startup) — REQUIRED” section uses a third-level
heading directly under the H1. Update that heading in the data-model markdown to
the correct level so the outline stays consistent and markdownlint MD001 is
satisfied.

In `@speckit/features/010-horizontal-scaling/spec.md`:
- Around line 413-417: The shared-state assumption in the horizontal-scaling
spec incorrectly includes rate-limit counters, which conflicts with FR-010 and
the later dependency text. Update the sentence in the spec so the
Postgres-backed shared store is described for sessions, challenges, leadership,
and cursors only, and remove any mention of rate-limit counters from that shared
coordination/state store assumption.

---

Outside diff comments:
In `@crates/server/src/jobs/canonicalization/processor.rs`:
- Around line 359-400: The canonicalization flow in `processor.rs` can leave
state partially applied because `submit_state`, auth syncing, and the canonical
delta update happen before the final lease validation. Update the
`canonicalization` path around `ensure_lease_held`, `submit_state`, and
`update_auth` so the state write and delta canonicalization are fenced
atomically under a single lease check, or add an idempotent recovery mechanism
that safely reconciles `updated_state`, `canonical_delta`, and metadata if the
lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`,
and the final `DeltaStatus::canonical` transition consistent with one another.
- Around line 266-280: The discard path in processor.rs mutates metadata after a
storage delete without re-checking the lease, so a replica can stale-write state
after losing ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.

---

Nitpick comments:
In `@crates/server/src/dashboard/state.rs`:
- Around line 871-901: The digest-keying test currently only proves the digest
differs from the token and that authentication succeeds, so it can miss a
regression where plaintext tokens are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58f01332-ef5e-48a4-a152-748af84dfb23

📥 Commits

Reviewing files that changed from the base of the PR and between 57a43d1 and 7e4de78.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (47)

crates/server/Cargo.toml
crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql
crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql
crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql
crates/server/migrations/2026-06-24-000001_worker_leases/down.sql
crates/server/migrations/2026-06-24-000001_worker_leases/up.sql
crates/server/src/ack/mod.rs
crates/server/src/api/dashboard.rs
crates/server/src/api/evm.rs
crates/server/src/builder/handle.rs
crates/server/src/builder/mod.rs
crates/server/src/builder/startup.rs
crates/server/src/builder/storage.rs
crates/server/src/config/mod.rs
crates/server/src/config/stage.rs
crates/server/src/coordination/challenge_store.rs
crates/server/src/coordination/leader.rs
crates/server/src/coordination/mod.rs
crates/server/src/coordination/postgres/challenge_store.rs
crates/server/src/coordination/postgres/lease.rs
crates/server/src/coordination/postgres/mod.rs
crates/server/src/coordination/postgres/session_store.rs
crates/server/src/coordination/session_store.rs
crates/server/src/dashboard/state.rs
crates/server/src/dashboard/types.rs
crates/server/src/evm/mod.rs
crates/server/src/evm/session.rs
crates/server/src/jobs/canonicalization/processor.rs
crates/server/src/jobs/canonicalization/worker.rs
crates/server/src/lib.rs
crates/server/src/main.rs
crates/server/src/middleware/rate_limit.rs
crates/server/src/schema.rs
crates/server/src/storage/postgres.rs
docs/CONFIGURATION.md
docs/SERVER_AWS_DEPLOY.md
docs/runbooks/horizontal-scaling.md
infra/data.tf
infra/ecs.tf
infra/variables.tf
packages/miden-multisig-client/test-results/.last-run.json
speckit/features/010-horizontal-scaling/contracts/config-contract.md
speckit/features/010-horizontal-scaling/contracts/coordination-traits.md
speckit/features/010-horizontal-scaling/contracts/db-schema.md
speckit/features/010-horizontal-scaling/data-model.md
speckit/features/010-horizontal-scaling/spec.md

💤 Files with no reviewable changes (1)

crates/server/src/dashboard/types.rs

coderabbitai · 2026-06-24T11:10:31Z

+CREATE TABLE auth_sessions (
+    token_digest BYTEA       PRIMARY KEY,
+    realm        TEXT        NOT NULL,
+    subject      JSONB       NOT NULL,
+    issued_at    TIMESTAMPTZ NOT NULL,
+    expires_at   TIMESTAMPTZ NOT NULL,
+    -- Set on logout; the row is kept until natural expiry so the revocation
+    -- is honored fleet-wide for as long as the token would have been valid.
+    revoked_at   TIMESTAMPTZ NULL
+);


🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make the session key realm-scoped in the database.

The table comment says operator and EVM sessions are isolated by realm, but the actual key is still token_digest alone here, and crates/server/src/schema.rs:111-120 mirrors that shape. If both realms ever mint the same token bytes, one insert will fail even though the contract says they should coexist. Make the primary key (realm, token_digest) so the DB enforces the namespace boundary instead of relying on token randomness.

Suggested direction

CREATE TABLE auth_sessions ( - token_digest BYTEA PRIMARY KEY, - realm TEXT NOT NULL, + realm TEXT NOT NULL, + token_digest BYTEA NOT NULL, subject JSONB NOT NULL, issued_at TIMESTAMPTZ NOT NULL, expires_at TIMESTAMPTZ NOT NULL, -- Set on logout; the row is kept until natural expiry so the revocation -- is honored fleet-wide for as long as the token would have been valid. - revoked_at TIMESTAMPTZ NULL + revoked_at TIMESTAMPTZ NULL, + PRIMARY KEY (realm, token_digest) );

This also needs the matching Diesel schema and any SQL built around the current single-column key updated in the same PR.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CREATE TABLE auth_sessions (

token_digest BYTEA PRIMARY KEY,

realm TEXT NOT NULL,

subject JSONB NOT NULL,

issued_at TIMESTAMPTZ NOT NULL,

expires_at TIMESTAMPTZ NOT NULL,

-- Set on logout; the row is kept until natural expiry so the revocation

-- is honored fleet-wide for as long as the token would have been valid.

revoked_at TIMESTAMPTZ NULL

);

CREATE TABLE auth_sessions (

realm TEXT NOT NULL,

token_digest BYTEA NOT NULL,

subject JSONB NOT NULL,

issued_at TIMESTAMPTZ NOT NULL,

expires_at TIMESTAMPTZ NOT NULL,

-- Set on logout; the row is kept until natural expiry so the revocation

-- is honored fleet-wide for as long as the token would have been valid.

revoked_at TIMESTAMPTZ NULL,

PRIMARY KEY (realm, token_digest)

);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql` around lines 7 - 16, The auth_sessions table is still keyed only by token_digest, so realm-scoped sessions can collide across operator and EVM namespaces. Update the migration to make the primary key composite on (realm, token_digest), then mirror that change in crates/server/src/schema.rs and any code or SQL that assumes a single-column key, such as auth session insert/delete/load paths, so Diesel and the database agree on the new realm-scoped identifier.

coderabbitai · 2026-06-24T11:10:31Z

+    if let Err(error) = state
        .dashboard
        .logout(token.as_deref(), state.clock.now())
-        .await;
+        .await
+    {
+        tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed");


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Fail closed when session revocation fails.

Returning success: true after a store revoke error can leave the server-side operator session valid until expiry, which is especially risky with shared Postgres sessions. Return an error response instead of reporting a successful logout. If this changes the documented response set, update the utoipa annotations and regenerated specs. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/api/dashboard.rs` around lines 183 - 188, The logout flow in the dashboard API currently logs revoke failures and still reports success, so it should fail closed instead. Update the handler in the logout path that calls state.dashboard.logout(...) to return an error response when revocation fails instead of continuing to a success payload, and make sure the response contract reflects this change. If the documented responses change, update the utoipa annotations on the relevant endpoint and regenerate the committed OpenAPI specs with the project’s gen-openapi command.

Source: Coding guidelines

coderabbitai · 2026-06-24T11:10:31Z

+    if let Err(error) = state
        .evm
        .sessions
        .logout(token.as_deref(), state.clock.now())
-        .await;
+        .await
+    {
+        tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
+    }


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Propagate EVM logout revoke failures.

This now reports logout success even if the backing session store failed to revoke the token, leaving the session usable until expiry. Keep the previous fail-closed behavior.

Suggested fix

- if let Err(error) = state - .evm - .sessions - .logout(token.as_deref(), state.clock.now()) - .await - { - tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed"); - } + state + .evm + .sessions + .logout(token.as_deref(), state.clock.now()) + .await?;

If the intended HTTP contract is to always return success, update the OpenAPI response documentation accordingly. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if let Err(error) = state

.evm

.sessions

.logout(token.as_deref(), state.clock.now())

.await;

.await

{

tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");

}

state

.evm

.sessions

.logout(token.as_deref(), state.clock.now())

.await?;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/api/evm.rs` around lines 183 - 190, The EVM logout handler currently swallows failures from state.evm.sessions.logout and still returns success; restore the previous fail-closed behavior by propagating that error out of the logout path instead of only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...). Update the logout endpoint logic in evm.rs so a revoke failure results in an error response from the handler, and if you intentionally keep success responses then also update the OpenAPI response docs/specs to match the new HTTP contract.

Source: Coding guidelines

coderabbitai · 2026-06-24T11:10:31Z

+        let coordination = self.coordination;
+        let coordination_mode = coordination
+            .as_ref()
+            .map(|handles| handles.mode)
+            .unwrap_or(crate::coordination::CoordinationMode::SingleProcess);
+        let leader: Arc<dyn crate::coordination::LeaderElector> = coordination
+            .as_ref()
+            .map(|handles| handles.leader.clone())
+            .unwrap_or_else(|| {
+                Arc::new(crate::coordination::AlwaysLeader::new(
+                    crate::coordination::CANONICALIZATION_LEASE,
+                    "single-process",
+                ))
+            });
        let dashboard = match self.dashboard {
            Some(dashboard) => dashboard,
-            None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
+            None => match coordination.as_ref() {
+                Some(handles) => Arc::new(
+                    DashboardState::from_env_for_network_with_stores(
+                        network_type,
+                        handles.operator_sessions.clone(),
+                        handles.operator_challenges.clone(),
+                    )
+                    .await?,
+                ),
+                // Fail closed: the Postgres backend must not silently fall back to
+                // per-process dashboard state. Coordination handles are populated
+                // by StorageMetadataBuilder::build(); their absence here means a
+                // manual builder skipped them.
+                None if storage.kind() == crate::storage::StorageType::Postgres => {
+                    return Err(
+                        "Postgres storage requires coordination handles for shared dashboard \
+                         sessions/challenges; call .coordination(...) (populated by \
+                         StorageMetadataBuilder::build())"
+                            .to_string(),
+                    );
+                }
+                None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
+            },
        };
        #[cfg(feature = "evm")]
-        let evm = Arc::new(EvmAppState::from_env().await?);
+        let evm = {
+            let sessions = match coordination.as_ref() {
+                Some(handles) => crate::evm::EvmSessionState::new(
+                    handles.evm_sessions.clone(),
+                    handles.evm_challenges.clone(),
+                ),
+                None => crate::evm::EvmSessionState::default(),
+            };
+            Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)
+        };


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fail closed before accepting any Postgres build without coordination.

Line 484 only rejects missing coordination when self.dashboard is None. A manual Postgres builder that supplies a custom dashboard still falls through with AlwaysLeader, so multiple replicas can run canonicalization as leaders and use non-shared session state. Move the Postgres/coordination check immediately after let coordination = self.coordination;.

Proposed fix

let ack = self.ack.ok_or("AckRegistry not set. Use .ack(...)")?; let coordination = self.coordination; + if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres { + return Err( + "Postgres storage requires coordination handles for shared sessions/challenges \ + and canonicalization leadership; call .coordination(...) (populated by \ + StorageMetadataBuilder::build())" + .to_string(), + ); + } let coordination_mode = coordination .as_ref() .map(|handles| handles.mode) .unwrap_or(crate::coordination::CoordinationMode::SingleProcess);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let coordination = self.coordination;

let coordination_mode = coordination

.as_ref()

.map(|handles| handles.mode)

.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);

let leader: Arc<dyn crate::coordination::LeaderElector> = coordination

.as_ref()

.map(|handles| handles.leader.clone())

.unwrap_or_else(|| {

Arc::new(crate::coordination::AlwaysLeader::new(

crate::coordination::CANONICALIZATION_LEASE,

"single-process",

))

});

let dashboard = match self.dashboard {

Some(dashboard) => dashboard,

None => Arc::new(DashboardState::from_env_for_network(network_type).await?),

None => match coordination.as_ref() {

Some(handles) => Arc::new(

DashboardState::from_env_for_network_with_stores(

network_type,

handles.operator_sessions.clone(),

handles.operator_challenges.clone(),

)

.await?,

),

// Fail closed: the Postgres backend must not silently fall back to

// per-process dashboard state. Coordination handles are populated

// by StorageMetadataBuilder::build(); their absence here means a

// manual builder skipped them.

None if storage.kind() == crate::storage::StorageType::Postgres => {

return Err(

"Postgres storage requires coordination handles for shared dashboard \

sessions/challenges; call .coordination(...) (populated by \

StorageMetadataBuilder::build())"

.to_string(),

);

}

None => Arc::new(DashboardState::from_env_for_network(network_type).await?),

},

};

#[cfg(feature = "evm")]

let evm = Arc::new(EvmAppState::from_env().await?);

let evm = {

let sessions = match coordination.as_ref() {

Some(handles) => crate::evm::EvmSessionState::new(

handles.evm_sessions.clone(),

handles.evm_challenges.clone(),

),

None => crate::evm::EvmSessionState::default(),

};

Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)

};

let coordination = self.coordination;

if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres {

return Err(

"Postgres storage requires coordination handles for shared sessions/challenges \

and canonicalization leadership; call .coordination(...) (populated by \

StorageMetadataBuilder::build())"

.to_string(),

);

}

let coordination_mode = coordination

.as_ref()

.map(|handles| handles.mode)

.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);

let leader: Arc<dyn crate::coordination::LeaderElector> = coordination

.as_ref()

.map(|handles| handles.leader.clone())

.unwrap_or_else(|| {

Arc::new(crate::coordination::AlwaysLeader::new(

crate::coordination::CANONICALIZATION_LEASE,

"single-process",

))

});

let dashboard = match self.dashboard {

Some(dashboard) => dashboard,

None => match coordination.as_ref() {

Some(handles) => Arc::new(

DashboardState::from_env_for_network_with_stores(

network_type,

handles.operator_sessions.clone(),

handles.operator_challenges.clone(),

)

.await?,

),

// Fail closed: the Postgres backend must not silently fall back to

// per-process dashboard state. Coordination handles are populated

// by StorageMetadataBuilder::build(); their absence here means a

// manual builder skipped them.

None if storage.kind() == crate::storage::StorageType::Postgres => {

return Err(

"Postgres storage requires coordination handles for shared dashboard \

sessions/challenges; call .coordination(...) (populated by \

StorageMetadataBuilder::build())"

.to_string(),

);

}

None => Arc::new(DashboardState::from_env_for_network(network_type).await?),

},

};

#[cfg(feature = "evm")]

let evm = {

let sessions = match coordination.as_ref() {

Some(handles) => crate::evm::EvmSessionState::new(

handles.evm_sessions.clone(),

handles.evm_challenges.clone(),

),

None => crate::evm::EvmSessionState::default(),

};

Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)

};

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/builder/mod.rs` around lines 455 - 505, The Postgres coordination guard in the builder is too late because it only runs inside the dashboard branch, allowing a manual Postgres build with a custom dashboard to still proceed with AlwaysLeader and per-process state. Move the missing-coordination check in the builder flow immediately after `let coordination = self.coordination;` in `crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or `dashboard` are derived, so any Postgres build without coordination handles fails closed regardless of how `dashboard` is provided.

coderabbitai · 2026-06-24T11:10:31Z

+    match std::env::var(ENV_GUARDIAN_ENV) {
+        Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),
+        Err(std::env::VarError::NotPresent) => Ok(false),


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Trim GUARDIAN_ENV before applying production guards.

GUARDIAN_ENV="prod " currently evaluates as non-prod, which can skip prod-only startup guards such as rejecting filesystem-backed deployments.

Proposed fix

- Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)), + Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

match std::env::var(ENV_GUARDIAN_ENV) {

Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),

Err(std::env::VarError::NotPresent) => Ok(false),

match std::env::var(ENV_GUARDIAN_ENV) {

Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)),

Err(std::env::VarError::NotPresent) => Ok(false),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/config/stage.rs` around lines 9 - 11, `is_prod` currently checks ENV_GUARDIAN_ENV directly, so values like "prod " are treated as non-production and bypass startup guards. Update the logic in `stage.rs` to trim the environment value before comparing it in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for `VarError::NotPresent` and other errors unchanged.

coderabbitai · 2026-06-24T11:10:32Z

+a time. Terraform also sets `GUARDIAN_MAX_REPLICAS` from
+`effective_server_autoscaling_max_capacity` (`max(desired, 6)`) so rate limits
+are partitioned across the fleet, and requires `GUARDIAN_DASHBOARD_CURSOR_SECRET`.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use effective_guardian_max_replicas wording to match actual Terraform behavior.

This currently implies a direct mapping from autoscaling max only. Please document that GUARDIAN_MAX_REPLICAS is set from the effective value (default from autoscaling max, but overrideable and clamped upward), so operator guidance matches the deployed contract.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/SERVER_AWS_DEPLOY.md` around lines 556 - 558, Update the Terraform/runtime docs to use the exact `effective_guardian_max_replicas` terminology and clarify the `GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from `effective_server_autoscaling_max_capacity`, can be overridden, and is clamped upward) rather than implying a direct autoscaling-max-only mapping, so the wording matches the deployed behavior.

coderabbitai · 2026-06-24T11:10:32Z

+variable "guardian_max_replicas" {
+  description = <<-EOT
+    Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
+    server divides global rate limits by. Defaults to the effective autoscaling
+    max capacity. Drives rate-limit partitioning only (coordination mode is
+    backend-derived). A value below the real max lets the aggregate exceed the
+    global limit, so an explicit override must be >= the autoscaling max.
+  EOT
+  type        = number
+  default     = null
+
+  validation {
+    condition     = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
+    error_message = "guardian_max_replicas must be >= 1 when set."


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Restrict guardian_max_replicas to whole numbers.

type = number currently allows values like 2.5. That gets stringified in infra/ecs.tf and then RateLimitConfig::from_env parses it as u32; a fractional value fails parsing and silently falls back to 1, which disables the intended fleet-wide partitioning and makes the aggregate limit too loose.

Suggested fix

variable "guardian_max_replicas" { @@ validation { - condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1 - error_message = "guardian_max_replicas must be >= 1 when set." + condition = var.guardian_max_replicas == null || ( + var.guardian_max_replicas >= 1 && + floor(var.guardian_max_replicas) == var.guardian_max_replicas + ) + error_message = "guardian_max_replicas must be a whole number >= 1 when set." } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

variable "guardian_max_replicas" {

description = <<-EOT

Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the

server divides global rate limits by. Defaults to the effective autoscaling

max capacity. Drives rate-limit partitioning only (coordination mode is

backend-derived). A value below the real max lets the aggregate exceed the

global limit, so an explicit override must be >= the autoscaling max.

EOT

type = number

default = null

validation {

condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1

error_message = "guardian_max_replicas must be >= 1 when set."

variable "guardian_max_replicas" {

description = <<-EOT

Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the

server divides global rate limits by. Defaults to the effective autoscaling

max capacity. Drives rate-limit partitioning only (coordination mode is

backend-derived). A value below the real max lets the aggregate exceed the

global limit, so an explicit override must be >= the autoscaling max.

EOT

type = number

default = null

validation {

condition = var.guardian_max_replicas == null || (

var.guardian_max_replicas >= 1 &&

floor(var.guardian_max_replicas) == var.guardian_max_replicas

)

error_message = "guardian_max_replicas must be a whole number >= 1 when set."

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/variables.tf` around lines 305 - 318, The guardian_max_replicas variable currently accepts fractional numbers, but infra/ecs.tf stringifies it and RateLimitConfig::from_env later parses it as u32, so non-integers can fall back to 1 and break rate-limit partitioning. Update the validation on guardian_max_replicas in variables.tf to require whole numbers (and keep the existing >= 1 rule), so only integer values are allowed. Use the existing guardian_max_replicas variable block and its validation stanza as the place to enforce this.

coderabbitai · 2026-06-24T11:10:32Z

+  where `local.effective_guardian_max_replicas = var.guardian_max_replicas != null ? var.guardian_max_replicas : local.effective_server_autoscaling_max_capacity`
+  (new `var.guardian_max_replicas` defaults to `null`, i.e. derive from max
+  capacity; operators may override).


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

The effective_guardian_max_replicas formula here is missing the upward clamp.

Please update this expression to reflect the implemented contract (explicit override cannot go below autoscaling max). Right now the example allows a too-low override unless separately validated, which conflicts with the runbook behavior.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md` around lines 40 - 42, The `effective_guardian_max_replicas` example in the config contract is missing the upward clamp and should match the implemented behavior. Update the expression near `local.effective_guardian_max_replicas` so it preserves the explicit `var.guardian_max_replicas` override only when it is at least the autoscaling max capacity, otherwise falls back to `local.effective_server_autoscaling_max_capacity`; this keeps the documented contract aligned with the runbook and the actual validation logic.

coderabbitai · 2026-06-24T11:10:32Z

+All timestamps are `TIMESTAMPTZ` and all expiry comparisons use the database
+clock (`now()`), giving a single authoritative clock across replicas.
+
+### Migration concurrency (multi-replica startup) — REQUIRED


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the heading level jump.

This starts at ### directly under the H1 section, which trips markdownlint MD001 and makes the outline inconsistent.

Suggested fix

-### Migration concurrency (multi-replica startup) — REQUIRED +## Migration concurrency (multi-replica startup) — REQUIRED

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Migration concurrency (multi-replica startup) — REQUIRED

## Migration concurrency (multi-replica startup) — REQUIRED

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 14-14: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@speckit/features/010-horizontal-scaling/data-model.md` at line 14, The document has a heading level jump because the “Migration concurrency (multi-replica startup) — REQUIRED” section uses a third-level heading directly under the H1. Update that heading in the data-model markdown to the correct level so the outline stays consistent and markdownlint MD001 is satisfied.

Source: Linters/SAST tools

coderabbitai · 2026-06-24T11:10:32Z

+- The shipped production image is built with the Postgres storage backend, so a
+  shared relational database is available to replicas and is the natural shared
+  coordination/state store for sessions, challenges, leadership, cursors, and
+  rate-limit counters. No new infrastructure component (e.g. a separate cache or
+  queue) is assumed to be mandatory; if one is proposed it will be justified in


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Remove rate-limit counters from this assumption.

This sentence conflicts with FR-010 and the later dependency text, which explicitly say rate limiting stays per-process and does not add a shared-store dependency. Leaving both statements in the spec makes the contract ambiguous.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@speckit/features/010-horizontal-scaling/spec.md` around lines 413 - 417, The shared-state assumption in the horizontal-scaling spec incorrectly includes rate-limit counters, which conflicts with FR-010 and the later dependency text. Update the sentence in the spec so the Postgres-backed shared store is described for sessions, challenges, leadership, and cursors only, and remove any mention of rate-limit counters from that shared coordination/state store assumption.

codecov-commenter · 2026-06-24T11:27:51Z

Codecov Report

❌ Patch coverage is 72.76657% with 189 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.47%. Comparing base (d39aa17) to head (7e4de78).
⚠️ Report is 4 commits behind head on main.

❌ Your patch status has failed because the patch coverage (72.76%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
- Coverage   76.88%   76.47%   -0.42%     
==========================================
  Files         153      159       +6     
  Lines       27382    28274     +892     
==========================================
+ Hits        21054    21622     +568     
- Misses       6328     6652     +324

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d39aa17...7e4de78. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Attribute cursor-secret enforcement to the Compose ${VAR:?} expansion rather than a server-side prod guard (the server cursor secret is optional on main; the hard requirement lands with #301). - Production compose: require an explicit GUARDIAN_VERSION (drop the :latest default) and bind the metrics port to loopback (127.0.0.1:9464). - Track B smoke: drop the unconfirmed "storage encryption" log grep; rely on ECDSA-signer-ready + clean startup. - Remove the design-spec artifact from the PR (brainstorming doc, not repo content). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zeljkoX added 4 commits June 22, 2026 08:28

feat: Scalability improvements

26d66d1

chore: Partial

67ae96e

feat: Scalability implementation

2bbd261

chore: Normalize spec files

7e4de78

zeljkoX requested a review from Copilot June 24, 2026 10:55

github-actions Bot added the cla: allowlist label Jun 24, 2026

Copilot started reviewing on behalf of zeljkoX June 24, 2026 10:55 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

zeljkoX mentioned this pull request Jun 24, 2026

docs: Add production guide #302

Draft

	### Migration concurrency (multi-replica startup) — REQUIRED
	## Migration concurrency (multi-replica startup) — REQUIRED

Conversation

zeljkoX commented Jun 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zeljkoX commented Jun 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading