Skip to content

feat: 010 scalability improvements#301

Draft
zeljkoX wants to merge 4 commits into
mainfrom
010-scalability-improvements
Draft

feat: 010 scalability improvements#301
zeljkoX wants to merge 4 commits into
mainfrom
010-scalability-improvements

Conversation

@zeljkoX

@zeljkoX zeljkoX commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added horizontal-scaling support for shared sessions, login challenges, and worker coordination across replicas.
    • Introduced fleet-wide rate limiting with per-replica partitioning and new startup visibility into coordination mode.
  • Bug Fixes

    • Logout flows now continue to succeed even if session revocation hits a temporary backend error.
    • Production startup now fails fast when required shared settings are missing or the filesystem backend is used.
  • Documentation

    • Updated configuration, deployment, and ops runbooks for multi-replica production setups.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d1145e5-746a-42ca-ba01-86cb4e02a207

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 010-scalability-improvements

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the first slice of feature 010 horizontal scaling correctness for the Guardian server by moving previously per-process coordination state (auth sessions/challenges and canonicalization ownership) into shared Postgres-backed coordination, adding per-replica rate-limit partitioning keyed by an infra-sourced max replica capacity, and documenting the required HA configuration.

Changes:

  • Introduces a coordination subsystem with in-memory (filesystem/dev) and Postgres (prod) implementations for sessions, challenges, and canonicalization leadership.
  • Updates canonicalization to run under a shared lease with renewal + fencing checks, and adds periodic sweeps for expired sessions/challenges.
  • Wires GUARDIAN_MAX_REPLICAS through Terraform and documents HA/runbook guidance + prod-stage startup guards.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
speckit/features/010-horizontal-scaling/spec.md New feature spec describing HA correctness goals, scenarios, and requirements.
speckit/features/010-horizontal-scaling/data-model.md Defines Postgres tables + concurrency approach (advisory lock) for replica-safe coordination.
speckit/features/010-horizontal-scaling/contracts/db-schema.md Migration contract for new coordination tables + required atomic operations.
speckit/features/010-horizontal-scaling/contracts/coordination-traits.md Contract for new internal coordination traits and their selection rules.
speckit/features/010-horizontal-scaling/contracts/config-contract.md Contract for env vars and prod-stage startup guards/log line for HA mode.
packages/miden-multisig-client/test-results/.last-run.json Adds a test-results artifact file under the TS multisig client.
infra/variables.tf Adds guardian_max_replicas Terraform variable and validation.
infra/ecs.tf Injects GUARDIAN_MAX_REPLICAS into the ECS task definition environment.
infra/data.tf Computes effective_guardian_max_replicas (clamped to autoscaling max capacity).
docs/SERVER_AWS_DEPLOY.md Adds a section explaining HA behavior for the prod profile and links the runbook.
docs/runbooks/horizontal-scaling.md New operator runbook for multi-replica HA deployments and trade-offs/failure modes.
docs/CONFIGURATION.md Documents GUARDIAN_MAX_REPLICAS, prod-stage guards, and HA behavior notes.
crates/server/src/storage/postgres.rs Serializes Diesel migrations across replicas using a Postgres advisory lock.
crates/server/src/schema.rs Adds Diesel schema definitions for new coordination tables.
crates/server/src/middleware/rate_limit.rs Partitions rate limits by GUARDIAN_MAX_REPLICAS + adds tests and warnings.
crates/server/src/main.rs Threads coordination handles from storage builder into server builder.
crates/server/src/lib.rs Exposes new config and coordination modules.
crates/server/src/jobs/canonicalization/worker.rs Acquires/renews a shared lease and cancels work if leadership is lost.
crates/server/src/jobs/canonicalization/processor.rs Adds fencing checks and cooperative cancellation checkpoints during canonicalization.
crates/server/src/evm/session.rs Moves EVM auth challenges/sessions to coordination stores; adds sweep support.
crates/server/src/evm/mod.rs Allows constructing EVM app state with injected session/challenge stores.
crates/server/src/dashboard/types.rs Removes now-obsolete in-memory session/challenge record types.
crates/server/src/dashboard/state.rs Moves operator challenges/sessions to coordination stores; adds prod cursor-secret guard and sweeps.
crates/server/src/coordination/session_store.rs New SessionStore trait + in-memory implementation.
crates/server/src/coordination/postgres/session_store.rs New Postgres SessionStore implementation backed by auth_sessions.
crates/server/src/coordination/postgres/mod.rs New Postgres coordination module utilities and exports.
crates/server/src/coordination/postgres/lease.rs New Postgres lease elector implementation backed by worker_leases.
crates/server/src/coordination/postgres/challenge_store.rs New Postgres ChallengeStore implementation backed by auth_challenges.
crates/server/src/coordination/mod.rs New coordination handle wiring + backend-derived coordination mode selection.
crates/server/src/coordination/leader.rs New LeaderElector trait + AlwaysLeader in-memory implementation.
crates/server/src/coordination/challenge_store.rs New ChallengeStore trait + in-memory implementation + payload encoding/decoding.
crates/server/src/config/stage.rs Centralizes GUARDIAN_ENV=prod detection used by startup guards.
crates/server/src/config/mod.rs New config module root.
crates/server/src/builder/storage.rs Returns CoordinationHandles from storage builder (Postgres vs in-memory).
crates/server/src/builder/startup.rs Logs a single “coordination” line describing resolved mode/backend/stage/max_replicas/secret source.
crates/server/src/builder/mod.rs Wires coordination into dashboard/EVM state construction and adds prod guard for rate-limit partitioning to zero.
crates/server/src/builder/handle.rs Starts canonicalization with a leader handle and adds a periodic session/challenge sweep worker.
crates/server/src/api/evm.rs Adjusts EVM logout to handle possible revoke errors (currently logs + continues).
crates/server/src/api/dashboard.rs Adjusts operator logout to handle possible revoke errors (currently logs + continues).
crates/server/src/ack/mod.rs Reuses the new config::stage::is_prod() helper for ACK secret selection.
crates/server/migrations/2026-06-24-000001_worker_leases/up.sql New migration creating worker_leases.
crates/server/migrations/2026-06-24-000001_worker_leases/down.sql New migration dropping worker_leases.
crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql New migration creating auth_challenges + indexes.
crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql New migration dropping auth_challenges + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql New migration creating auth_sessions + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql New migration dropping auth_sessions + indexes.
crates/server/Cargo.toml Adds tokio-util dependency for cancellation support in canonicalization.
Cargo.lock Updates lockfile for new dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 182 to +189
let token = extract_cookie(&headers, state.dashboard.cookie_name());
state
if let Err(error) = state
.dashboard
.logout(token.as_deref(), state.clock.now())
.await;
.await
{
tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed");
}
Comment on lines 182 to +190
let token = extract_cookie(&headers, state.evm.sessions.cookie_name());
state
if let Err(error) = state
.evm
.sessions
.logout(token.as_deref(), state.clock.now())
.await;
.await
{
tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
}
Comment on lines +426 to +431
if !cfg!(test) {
tracing::warn!(
"dashboard cursor secret not configured; generating ephemeral per-process \
secret. Multi-replica deployments must set \
GUARDIAN_DASHBOARD_CURSOR_SECRET to a stable shared 32-byte hex value."
);
Comment thread infra/variables.tf
Comment on lines +305 to +314
variable "guardian_max_replicas" {
description = <<-EOT
Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
server divides global rate limits by. Defaults to the effective autoscaling
max capacity. Drives rate-limit partitioning only (coordination mode is
backend-derived). A value below the real max lets the aggregate exceed the
global limit, so an explicit override must be >= the autoscaling max.
EOT
type = number
default = null
Comment on lines +1 to +4
{
"status": "passed",
"failedTests": []
} No newline at end of file
Comment thread infra/variables.tf
Comment on lines +316 to +319
validation {
condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
error_message = "guardian_max_replicas must be >= 1 when set."
}

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/server/src/jobs/canonicalization/processor.rs (2)

359-400: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Avoid partial canonicalization when the lease drops mid-sequence.

submit_state(...) commits before the second lease check at Line 400. If the lease is lost after Line 359 but before Line 400, the processor aborts with the account state advanced while the delta remains non-canonical. This needs an atomic, fenced write path for state + canonical delta + metadata, or an explicit idempotent recovery strategy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 359 - 400,
The canonicalization flow in `processor.rs` can leave state partially applied
because `submit_state`, auth syncing, and the canonical delta update happen
before the final lease validation. Update the `canonicalization` path around
`ensure_lease_held`, `submit_state`, and `update_auth` so the state write and
delta canonicalization are fenced atomically under a single lease check, or add
an idempotent recovery mechanism that safely reconciles `updated_state`,
`canonical_delta`, and metadata if the lease is lost mid-sequence. Ensure the
fix keeps `submit_state`, `update_auth`, and the final `DeltaStatus::canonical`
transition consistent with one another.

266-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fence metadata writes as well as storage writes.

set_has_pending_candidate(...) and update_auth(...) are state-mutating writes but are not guarded immediately before execution. A replica can pass an earlier fence, lose the lease, then still mutate metadata. Add ensure_lease_held(&delta).await? directly before each metadata mutation, or move these writes into a single fenced storage operation.

Also applies to: 383-386, 409-412

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 266 - 280,
The discard path in processor.rs mutates metadata after a storage delete without
re-checking the lease, so a replica can stale-write state after losing
ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.
🧹 Nitpick comments (1)
crates/server/src/dashboard/state.rs (1)

871-901: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Make the digest-keying test observe the inserted store key.

This test would still pass if the implementation regressed to storing and looking up plaintext tokens, because it only checks session_digest(token) != token and then authenticates through the same production path. Inject a small recording SessionStore and assert insert(...) receives session_digest(token), never the raw token bytes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/dashboard/state.rs` around lines 871 - 901, The
digest-keying test currently only proves the digest differs from the token and
that authentication succeeds, so it can miss a regression where plaintext tokens
are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql`:
- Around line 7-16: The auth_sessions table is still keyed only by token_digest,
so realm-scoped sessions can collide across operator and EVM namespaces. Update
the migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.

In `@crates/server/src/api/dashboard.rs`:
- Around line 183-188: The logout flow in the dashboard API currently logs
revoke failures and still reports success, so it should fail closed instead.
Update the handler in the logout path that calls state.dashboard.logout(...) to
return an error response when revocation fails instead of continuing to a
success payload, and make sure the response contract reflects this change. If
the documented responses change, update the utoipa annotations on the relevant
endpoint and regenerate the committed OpenAPI specs with the project’s
gen-openapi command.

In `@crates/server/src/api/evm.rs`:
- Around line 183-190: The EVM logout handler currently swallows failures from
state.evm.sessions.logout and still returns success; restore the previous
fail-closed behavior by propagating that error out of the logout path instead of
only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...).
Update the logout endpoint logic in evm.rs so a revoke failure results in an
error response from the handler, and if you intentionally keep success responses
then also update the OpenAPI response docs/specs to match the new HTTP contract.

In `@crates/server/src/builder/mod.rs`:
- Around line 455-505: The Postgres coordination guard in the builder is too
late because it only runs inside the dashboard branch, allowing a manual
Postgres build with a custom dashboard to still proceed with AlwaysLeader and
per-process state. Move the missing-coordination check in the builder flow
immediately after `let coordination = self.coordination;` in
`crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or
`dashboard` are derived, so any Postgres build without coordination handles
fails closed regardless of how `dashboard` is provided.

In `@crates/server/src/config/stage.rs`:
- Around line 9-11: `is_prod` currently checks ENV_GUARDIAN_ENV directly, so
values like "prod " are treated as non-production and bypass startup guards.
Update the logic in `stage.rs` to trim the environment value before comparing it
in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for
`VarError::NotPresent` and other errors unchanged.

In `@crates/server/src/coordination/postgres/challenge_store.rs`:
- Around line 76-121: The issue is that concurrent issue() calls in
ChallengeStore::issue can bypass the per-principal cap because each transaction
sees stale outstanding counts before commit. Add a transaction-scoped lock keyed
by (realm, principal) at the start of the transaction in
PostgresChallengeStore::issue, before the INSERT/DELETE cleanup sequence, so
only one issuance for the same principal runs at a time. Use the existing
transaction block around conn.transaction and place the lock before the
auth_challenges statements that enforce max_outstanding.

In `@crates/server/src/coordination/postgres/lease.rs`:
- Around line 55-69: The lease flow in acquire/release is resetting fence_token
instead of keeping it strictly increasing, which breaks monotonic fencing across
reacquire. Update the lease handling in the try_acquire logic and the release
method so a holder’s reacquisition always advances fence_token, and avoid
deleting the worker_leases row on release if that would reset the counter. Use
the existing worker_leases upsert and fence_token update logic as the place to
enforce the monotonic increment on every acquisition path.

In `@crates/server/src/middleware/rate_limit.rs`:
- Around line 463-465: The env-mutation lock is not being applied consistently
across the rate-limit test suite, so `test_rate_limit_layer_from_env` can still
race with other tests. Update that test to acquire `ENV_LOCK` before touching
shared environment state, and make sure it also clears `GUARDIAN_MAX_REPLICAS`
along with the existing rate-limit env vars. Review the other env-mutating tests
in `rate_limit.rs` and wrap them with the same `ENV_LOCK` pattern so all
shared-process-environment changes are serialized.

In `@crates/server/src/storage/postgres.rs`:
- Around line 45-51: The migration lock acquisition in the postgres startup path
uses pg_advisory_lock and can block forever, so update the lock logic in the
connection/setup flow around the advisory lock call to fail fast instead of
waiting indefinitely. Use pg_try_advisory_lock with a retry deadline/backoff, or
apply a statement/lock timeout before attempting the lock, and keep the existing
error propagation through the migration lock acquisition code so startup can
recover when the lock is held too long.

In `@docs/CONFIGURATION.md`:
- Line 288: The Multi-replica (HA) configuration row overstates what the prod
Terraform profile currently provides; update the wording in the documentation
table to reflect only the variables actually injected by the ECS task
definition. Use the existing configuration entry for the HA quick-start to make
it clear that `GUARDIAN_MAX_REPLICAS` is supplied, while
`GUARDIAN_DASHBOARD_CURSOR_SECRET` must be set separately and is not wired by
the prod Terraform profile.

In `@docs/SERVER_AWS_DEPLOY.md`:
- Around line 556-558: Update the Terraform/runtime docs to use the exact
`effective_guardian_max_replicas` terminology and clarify the
`GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain
that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from
`effective_server_autoscaling_max_capacity`, can be overridden, and is clamped
upward) rather than implying a direct autoscaling-max-only mapping, so the
wording matches the deployed behavior.

In `@infra/variables.tf`:
- Around line 305-318: The guardian_max_replicas variable currently accepts
fractional numbers, but infra/ecs.tf stringifies it and
RateLimitConfig::from_env later parses it as u32, so non-integers can fall back
to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.

In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md`:
- Around line 40-42: The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.

In `@speckit/features/010-horizontal-scaling/data-model.md`:
- Line 14: The document has a heading level jump because the “Migration
concurrency (multi-replica startup) — REQUIRED” section uses a third-level
heading directly under the H1. Update that heading in the data-model markdown to
the correct level so the outline stays consistent and markdownlint MD001 is
satisfied.

In `@speckit/features/010-horizontal-scaling/spec.md`:
- Around line 413-417: The shared-state assumption in the horizontal-scaling
spec incorrectly includes rate-limit counters, which conflicts with FR-010 and
the later dependency text. Update the sentence in the spec so the
Postgres-backed shared store is described for sessions, challenges, leadership,
and cursors only, and remove any mention of rate-limit counters from that shared
coordination/state store assumption.

---

Outside diff comments:
In `@crates/server/src/jobs/canonicalization/processor.rs`:
- Around line 359-400: The canonicalization flow in `processor.rs` can leave
state partially applied because `submit_state`, auth syncing, and the canonical
delta update happen before the final lease validation. Update the
`canonicalization` path around `ensure_lease_held`, `submit_state`, and
`update_auth` so the state write and delta canonicalization are fenced
atomically under a single lease check, or add an idempotent recovery mechanism
that safely reconciles `updated_state`, `canonical_delta`, and metadata if the
lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`,
and the final `DeltaStatus::canonical` transition consistent with one another.
- Around line 266-280: The discard path in processor.rs mutates metadata after a
storage delete without re-checking the lease, so a replica can stale-write state
after losing ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.

---

Nitpick comments:
In `@crates/server/src/dashboard/state.rs`:
- Around line 871-901: The digest-keying test currently only proves the digest
differs from the token and that authentication succeeds, so it can miss a
regression where plaintext tokens are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58f01332-ef5e-48a4-a152-748af84dfb23

📥 Commits

Reviewing files that changed from the base of the PR and between 57a43d1 and 7e4de78.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (47)
  • crates/server/Cargo.toml
  • crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql
  • crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
  • crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql
  • crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql
  • crates/server/migrations/2026-06-24-000001_worker_leases/down.sql
  • crates/server/migrations/2026-06-24-000001_worker_leases/up.sql
  • crates/server/src/ack/mod.rs
  • crates/server/src/api/dashboard.rs
  • crates/server/src/api/evm.rs
  • crates/server/src/builder/handle.rs
  • crates/server/src/builder/mod.rs
  • crates/server/src/builder/startup.rs
  • crates/server/src/builder/storage.rs
  • crates/server/src/config/mod.rs
  • crates/server/src/config/stage.rs
  • crates/server/src/coordination/challenge_store.rs
  • crates/server/src/coordination/leader.rs
  • crates/server/src/coordination/mod.rs
  • crates/server/src/coordination/postgres/challenge_store.rs
  • crates/server/src/coordination/postgres/lease.rs
  • crates/server/src/coordination/postgres/mod.rs
  • crates/server/src/coordination/postgres/session_store.rs
  • crates/server/src/coordination/session_store.rs
  • crates/server/src/dashboard/state.rs
  • crates/server/src/dashboard/types.rs
  • crates/server/src/evm/mod.rs
  • crates/server/src/evm/session.rs
  • crates/server/src/jobs/canonicalization/processor.rs
  • crates/server/src/jobs/canonicalization/worker.rs
  • crates/server/src/lib.rs
  • crates/server/src/main.rs
  • crates/server/src/middleware/rate_limit.rs
  • crates/server/src/schema.rs
  • crates/server/src/storage/postgres.rs
  • docs/CONFIGURATION.md
  • docs/SERVER_AWS_DEPLOY.md
  • docs/runbooks/horizontal-scaling.md
  • infra/data.tf
  • infra/ecs.tf
  • infra/variables.tf
  • packages/miden-multisig-client/test-results/.last-run.json
  • speckit/features/010-horizontal-scaling/contracts/config-contract.md
  • speckit/features/010-horizontal-scaling/contracts/coordination-traits.md
  • speckit/features/010-horizontal-scaling/contracts/db-schema.md
  • speckit/features/010-horizontal-scaling/data-model.md
  • speckit/features/010-horizontal-scaling/spec.md
💤 Files with no reviewable changes (1)
  • crates/server/src/dashboard/types.rs

Comment on lines +7 to +16
CREATE TABLE auth_sessions (
token_digest BYTEA PRIMARY KEY,
realm TEXT NOT NULL,
subject JSONB NOT NULL,
issued_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
-- Set on logout; the row is kept until natural expiry so the revocation
-- is honored fleet-wide for as long as the token would have been valid.
revoked_at TIMESTAMPTZ NULL
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make the session key realm-scoped in the database.

The table comment says operator and EVM sessions are isolated by realm, but the actual key is still token_digest alone here, and crates/server/src/schema.rs:111-120 mirrors that shape. If both realms ever mint the same token bytes, one insert will fail even though the contract says they should coexist. Make the primary key (realm, token_digest) so the DB enforces the namespace boundary instead of relying on token randomness.

Suggested direction
 CREATE TABLE auth_sessions (
-    token_digest BYTEA       PRIMARY KEY,
-    realm        TEXT        NOT NULL,
+    realm        TEXT        NOT NULL,
+    token_digest BYTEA       NOT NULL,
     subject      JSONB       NOT NULL,
     issued_at    TIMESTAMPTZ NOT NULL,
     expires_at   TIMESTAMPTZ NOT NULL,
     -- Set on logout; the row is kept until natural expiry so the revocation
     -- is honored fleet-wide for as long as the token would have been valid.
-    revoked_at   TIMESTAMPTZ NULL
+    revoked_at   TIMESTAMPTZ NULL,
+    PRIMARY KEY (realm, token_digest)
 );

This also needs the matching Diesel schema and any SQL built around the current single-column key updated in the same PR.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CREATE TABLE auth_sessions (
token_digest BYTEA PRIMARY KEY,
realm TEXT NOT NULL,
subject JSONB NOT NULL,
issued_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
-- Set on logout; the row is kept until natural expiry so the revocation
-- is honored fleet-wide for as long as the token would have been valid.
revoked_at TIMESTAMPTZ NULL
);
CREATE TABLE auth_sessions (
realm TEXT NOT NULL,
token_digest BYTEA NOT NULL,
subject JSONB NOT NULL,
issued_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
-- Set on logout; the row is kept until natural expiry so the revocation
-- is honored fleet-wide for as long as the token would have been valid.
revoked_at TIMESTAMPTZ NULL,
PRIMARY KEY (realm, token_digest)
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql` around lines
7 - 16, The auth_sessions table is still keyed only by token_digest, so
realm-scoped sessions can collide across operator and EVM namespaces. Update the
migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.

Comment on lines +183 to +188
if let Err(error) = state
.dashboard
.logout(token.as_deref(), state.clock.now())
.await;
.await
{
tracing::warn!(auth_event = "logout_failed", %error, "Operator logout revoke failed");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Fail closed when session revocation fails.

Returning success: true after a store revoke error can leave the server-side operator session valid until expiry, which is especially risky with shared Postgres sessions. Return an error response instead of reporting a successful logout. If this changes the documented response set, update the utoipa annotations and regenerated specs. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/api/dashboard.rs` around lines 183 - 188, The logout flow
in the dashboard API currently logs revoke failures and still reports success,
so it should fail closed instead. Update the handler in the logout path that
calls state.dashboard.logout(...) to return an error response when revocation
fails instead of continuing to a success payload, and make sure the response
contract reflects this change. If the documented responses change, update the
utoipa annotations on the relevant endpoint and regenerate the committed OpenAPI
specs with the project’s gen-openapi command.

Source: Coding guidelines

Comment on lines +183 to +190
if let Err(error) = state
.evm
.sessions
.logout(token.as_deref(), state.clock.now())
.await;
.await
{
tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Propagate EVM logout revoke failures.

This now reports logout success even if the backing session store failed to revoke the token, leaving the session usable until expiry. Keep the previous fail-closed behavior.

Suggested fix
-    if let Err(error) = state
-        .evm
-        .sessions
-        .logout(token.as_deref(), state.clock.now())
-        .await
-    {
-        tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
-    }
+    state
+        .evm
+        .sessions
+        .logout(token.as_deref(), state.clock.now())
+        .await?;

If the intended HTTP contract is to always return success, update the OpenAPI response documentation accordingly. As per coding guidelines, crates/server/src/api/**/*.rs: “After any HTTP contract change, regenerate committed specs with cargo run --features evm --bin gen-openapi -- docs.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Err(error) = state
.evm
.sessions
.logout(token.as_deref(), state.clock.now())
.await;
.await
{
tracing::warn!(auth_event = "evm_logout_failed", %error, "EVM logout revoke failed");
}
state
.evm
.sessions
.logout(token.as_deref(), state.clock.now())
.await?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/api/evm.rs` around lines 183 - 190, The EVM logout handler
currently swallows failures from state.evm.sessions.logout and still returns
success; restore the previous fail-closed behavior by propagating that error out
of the logout path instead of only logging it with tracing::warn!(auth_event =
"evm_logout_failed", ...). Update the logout endpoint logic in evm.rs so a
revoke failure results in an error response from the handler, and if you
intentionally keep success responses then also update the OpenAPI response
docs/specs to match the new HTTP contract.

Source: Coding guidelines

Comment on lines +455 to +505
let coordination = self.coordination;
let coordination_mode = coordination
.as_ref()
.map(|handles| handles.mode)
.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);
let leader: Arc<dyn crate::coordination::LeaderElector> = coordination
.as_ref()
.map(|handles| handles.leader.clone())
.unwrap_or_else(|| {
Arc::new(crate::coordination::AlwaysLeader::new(
crate::coordination::CANONICALIZATION_LEASE,
"single-process",
))
});
let dashboard = match self.dashboard {
Some(dashboard) => dashboard,
None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
None => match coordination.as_ref() {
Some(handles) => Arc::new(
DashboardState::from_env_for_network_with_stores(
network_type,
handles.operator_sessions.clone(),
handles.operator_challenges.clone(),
)
.await?,
),
// Fail closed: the Postgres backend must not silently fall back to
// per-process dashboard state. Coordination handles are populated
// by StorageMetadataBuilder::build(); their absence here means a
// manual builder skipped them.
None if storage.kind() == crate::storage::StorageType::Postgres => {
return Err(
"Postgres storage requires coordination handles for shared dashboard \
sessions/challenges; call .coordination(...) (populated by \
StorageMetadataBuilder::build())"
.to_string(),
);
}
None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
},
};
#[cfg(feature = "evm")]
let evm = Arc::new(EvmAppState::from_env().await?);
let evm = {
let sessions = match coordination.as_ref() {
Some(handles) => crate::evm::EvmSessionState::new(
handles.evm_sessions.clone(),
handles.evm_challenges.clone(),
),
None => crate::evm::EvmSessionState::default(),
};
Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fail closed before accepting any Postgres build without coordination.

Line 484 only rejects missing coordination when self.dashboard is None. A manual Postgres builder that supplies a custom dashboard still falls through with AlwaysLeader, so multiple replicas can run canonicalization as leaders and use non-shared session state. Move the Postgres/coordination check immediately after let coordination = self.coordination;.

Proposed fix
         let ack = self.ack.ok_or("AckRegistry not set. Use .ack(...)")?;
         let coordination = self.coordination;
+        if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres {
+            return Err(
+                "Postgres storage requires coordination handles for shared sessions/challenges \
+                 and canonicalization leadership; call .coordination(...) (populated by \
+                 StorageMetadataBuilder::build())"
+                    .to_string(),
+            );
+        }
         let coordination_mode = coordination
             .as_ref()
             .map(|handles| handles.mode)
             .unwrap_or(crate::coordination::CoordinationMode::SingleProcess);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let coordination = self.coordination;
let coordination_mode = coordination
.as_ref()
.map(|handles| handles.mode)
.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);
let leader: Arc<dyn crate::coordination::LeaderElector> = coordination
.as_ref()
.map(|handles| handles.leader.clone())
.unwrap_or_else(|| {
Arc::new(crate::coordination::AlwaysLeader::new(
crate::coordination::CANONICALIZATION_LEASE,
"single-process",
))
});
let dashboard = match self.dashboard {
Some(dashboard) => dashboard,
None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
None => match coordination.as_ref() {
Some(handles) => Arc::new(
DashboardState::from_env_for_network_with_stores(
network_type,
handles.operator_sessions.clone(),
handles.operator_challenges.clone(),
)
.await?,
),
// Fail closed: the Postgres backend must not silently fall back to
// per-process dashboard state. Coordination handles are populated
// by StorageMetadataBuilder::build(); their absence here means a
// manual builder skipped them.
None if storage.kind() == crate::storage::StorageType::Postgres => {
return Err(
"Postgres storage requires coordination handles for shared dashboard \
sessions/challenges; call .coordination(...) (populated by \
StorageMetadataBuilder::build())"
.to_string(),
);
}
None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
},
};
#[cfg(feature = "evm")]
let evm = Arc::new(EvmAppState::from_env().await?);
let evm = {
let sessions = match coordination.as_ref() {
Some(handles) => crate::evm::EvmSessionState::new(
handles.evm_sessions.clone(),
handles.evm_challenges.clone(),
),
None => crate::evm::EvmSessionState::default(),
};
Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)
};
let coordination = self.coordination;
if coordination.is_none() && storage.kind() == crate::storage::StorageType::Postgres {
return Err(
"Postgres storage requires coordination handles for shared sessions/challenges \
and canonicalization leadership; call .coordination(...) (populated by \
StorageMetadataBuilder::build())"
.to_string(),
);
}
let coordination_mode = coordination
.as_ref()
.map(|handles| handles.mode)
.unwrap_or(crate::coordination::CoordinationMode::SingleProcess);
let leader: Arc<dyn crate::coordination::LeaderElector> = coordination
.as_ref()
.map(|handles| handles.leader.clone())
.unwrap_or_else(|| {
Arc::new(crate::coordination::AlwaysLeader::new(
crate::coordination::CANONICALIZATION_LEASE,
"single-process",
))
});
let dashboard = match self.dashboard {
Some(dashboard) => dashboard,
None => match coordination.as_ref() {
Some(handles) => Arc::new(
DashboardState::from_env_for_network_with_stores(
network_type,
handles.operator_sessions.clone(),
handles.operator_challenges.clone(),
)
.await?,
),
// Fail closed: the Postgres backend must not silently fall back to
// per-process dashboard state. Coordination handles are populated
// by StorageMetadataBuilder::build(); their absence here means a
// manual builder skipped them.
None if storage.kind() == crate::storage::StorageType::Postgres => {
return Err(
"Postgres storage requires coordination handles for shared dashboard \
sessions/challenges; call .coordination(...) (populated by \
StorageMetadataBuilder::build())"
.to_string(),
);
}
None => Arc::new(DashboardState::from_env_for_network(network_type).await?),
},
};
#[cfg(feature = "evm")]
let evm = {
let sessions = match coordination.as_ref() {
Some(handles) => crate::evm::EvmSessionState::new(
handles.evm_sessions.clone(),
handles.evm_challenges.clone(),
),
None => crate::evm::EvmSessionState::default(),
};
Arc::new(EvmAppState::from_env_with_sessions(sessions).await?)
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/builder/mod.rs` around lines 455 - 505, The Postgres
coordination guard in the builder is too late because it only runs inside the
dashboard branch, allowing a manual Postgres build with a custom dashboard to
still proceed with AlwaysLeader and per-process state. Move the
missing-coordination check in the builder flow immediately after `let
coordination = self.coordination;` in `crates/server/src/builder/mod.rs`, before
`coordination_mode`, `leader`, or `dashboard` are derived, so any Postgres build
without coordination handles fails closed regardless of how `dashboard` is
provided.

Comment on lines +9 to +11
match std::env::var(ENV_GUARDIAN_ENV) {
Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),
Err(std::env::VarError::NotPresent) => Ok(false),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Trim GUARDIAN_ENV before applying production guards.

GUARDIAN_ENV="prod " currently evaluates as non-prod, which can skip prod-only startup guards such as rejecting filesystem-backed deployments.

Proposed fix
-        Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),
+        Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
match std::env::var(ENV_GUARDIAN_ENV) {
Ok(value) => Ok(value.eq_ignore_ascii_case(PROD_ENV)),
Err(std::env::VarError::NotPresent) => Ok(false),
match std::env::var(ENV_GUARDIAN_ENV) {
Ok(value) => Ok(value.trim().eq_ignore_ascii_case(PROD_ENV)),
Err(std::env::VarError::NotPresent) => Ok(false),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/config/stage.rs` around lines 9 - 11, `is_prod` currently
checks ENV_GUARDIAN_ENV directly, so values like "prod " are treated as
non-production and bypass startup guards. Update the logic in `stage.rs` to trim
the environment value before comparing it in `eq_ignore_ascii_case(PROD_ENV)`,
keeping the existing handling for `VarError::NotPresent` and other errors
unchanged.

Comment thread docs/SERVER_AWS_DEPLOY.md
Comment on lines +556 to +558
a time. Terraform also sets `GUARDIAN_MAX_REPLICAS` from
`effective_server_autoscaling_max_capacity` (`max(desired, 6)`) so rate limits
are partitioned across the fleet, and requires `GUARDIAN_DASHBOARD_CURSOR_SECRET`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Use effective_guardian_max_replicas wording to match actual Terraform behavior.

This currently implies a direct mapping from autoscaling max only. Please document that GUARDIAN_MAX_REPLICAS is set from the effective value (default from autoscaling max, but overrideable and clamped upward), so operator guidance matches the deployed contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/SERVER_AWS_DEPLOY.md` around lines 556 - 558, Update the
Terraform/runtime docs to use the exact `effective_guardian_max_replicas`
terminology and clarify the `GUARDIAN_MAX_REPLICAS` contract. In the
`SERVER_AWS_DEPLOY` guidance, explain that `GUARDIAN_MAX_REPLICAS` comes from
the effective value (derived from `effective_server_autoscaling_max_capacity`,
can be overridden, and is clamped upward) rather than implying a direct
autoscaling-max-only mapping, so the wording matches the deployed behavior.

Comment thread infra/variables.tf
Comment on lines +305 to +318
variable "guardian_max_replicas" {
description = <<-EOT
Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
server divides global rate limits by. Defaults to the effective autoscaling
max capacity. Drives rate-limit partitioning only (coordination mode is
backend-derived). A value below the real max lets the aggregate exceed the
global limit, so an explicit override must be >= the autoscaling max.
EOT
type = number
default = null

validation {
condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
error_message = "guardian_max_replicas must be >= 1 when set."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Restrict guardian_max_replicas to whole numbers.

type = number currently allows values like 2.5. That gets stringified in infra/ecs.tf and then RateLimitConfig::from_env parses it as u32; a fractional value fails parsing and silently falls back to 1, which disables the intended fleet-wide partitioning and makes the aggregate limit too loose.

Suggested fix
 variable "guardian_max_replicas" {
@@
   validation {
-    condition     = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
-    error_message = "guardian_max_replicas must be >= 1 when set."
+    condition = var.guardian_max_replicas == null || (
+      var.guardian_max_replicas >= 1 &&
+      floor(var.guardian_max_replicas) == var.guardian_max_replicas
+    )
+    error_message = "guardian_max_replicas must be a whole number >= 1 when set."
   }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
variable "guardian_max_replicas" {
description = <<-EOT
Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
server divides global rate limits by. Defaults to the effective autoscaling
max capacity. Drives rate-limit partitioning only (coordination mode is
backend-derived). A value below the real max lets the aggregate exceed the
global limit, so an explicit override must be >= the autoscaling max.
EOT
type = number
default = null
validation {
condition = var.guardian_max_replicas == null || var.guardian_max_replicas >= 1
error_message = "guardian_max_replicas must be >= 1 when set."
variable "guardian_max_replicas" {
description = <<-EOT
Optional override for GUARDIAN_MAX_REPLICAS, the maximum replica capacity the
server divides global rate limits by. Defaults to the effective autoscaling
max capacity. Drives rate-limit partitioning only (coordination mode is
backend-derived). A value below the real max lets the aggregate exceed the
global limit, so an explicit override must be >= the autoscaling max.
EOT
type = number
default = null
validation {
condition = var.guardian_max_replicas == null || (
var.guardian_max_replicas >= 1 &&
floor(var.guardian_max_replicas) == var.guardian_max_replicas
)
error_message = "guardian_max_replicas must be a whole number >= 1 when set."
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/variables.tf` around lines 305 - 318, The guardian_max_replicas
variable currently accepts fractional numbers, but infra/ecs.tf stringifies it
and RateLimitConfig::from_env later parses it as u32, so non-integers can fall
back to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.

Comment on lines +40 to +42
where `local.effective_guardian_max_replicas = var.guardian_max_replicas != null ? var.guardian_max_replicas : local.effective_server_autoscaling_max_capacity`
(new `var.guardian_max_replicas` defaults to `null`, i.e. derive from max
capacity; operators may override).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

The effective_guardian_max_replicas formula here is missing the upward clamp.

Please update this expression to reflect the implemented contract (explicit override cannot go below autoscaling max). Right now the example allows a too-low override unless separately validated, which conflicts with the runbook behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md` around
lines 40 - 42, The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.

All timestamps are `TIMESTAMPTZ` and all expiry comparisons use the database
clock (`now()`), giving a single authoritative clock across replicas.

### Migration concurrency (multi-replica startup) — REQUIRED

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Fix the heading level jump.

This starts at ### directly under the H1 section, which trips markdownlint MD001 and makes the outline inconsistent.

Suggested fix
-### Migration concurrency (multi-replica startup) — REQUIRED
+## Migration concurrency (multi-replica startup) — REQUIRED
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Migration concurrency (multi-replica startup) — REQUIRED
## Migration concurrency (multi-replica startup) — REQUIRED
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 14-14: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@speckit/features/010-horizontal-scaling/data-model.md` at line 14, The
document has a heading level jump because the “Migration concurrency
(multi-replica startup) — REQUIRED” section uses a third-level heading directly
under the H1. Update that heading in the data-model markdown to the correct
level so the outline stays consistent and markdownlint MD001 is satisfied.

Source: Linters/SAST tools

Comment on lines +413 to +417
- The shipped production image is built with the Postgres storage backend, so a
shared relational database is available to replicas and is the natural shared
coordination/state store for sessions, challenges, leadership, cursors, and
rate-limit counters. No new infrastructure component (e.g. a separate cache or
queue) is assumed to be mandatory; if one is proposed it will be justified in

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Remove rate-limit counters from this assumption.

This sentence conflicts with FR-010 and the later dependency text, which explicitly say rate limiting stays per-process and does not add a shared-store dependency. Leaving both statements in the spec makes the contract ambiguous.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@speckit/features/010-horizontal-scaling/spec.md` around lines 413 - 417, The
shared-state assumption in the horizontal-scaling spec incorrectly includes
rate-limit counters, which conflicts with FR-010 and the later dependency text.
Update the sentence in the spec so the Postgres-backed shared store is described
for sessions, challenges, leadership, and cursors only, and remove any mention
of rate-limit counters from that shared coordination/state store assumption.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.76657% with 189 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.47%. Comparing base (d39aa17) to head (7e4de78).
⚠️ Report is 4 commits behind head on main.

❌ Your patch status has failed because the patch coverage (72.76%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
- Coverage   76.88%   76.47%   -0.42%     
==========================================
  Files         153      159       +6     
  Lines       27382    28274     +892     
==========================================
+ Hits        21054    21622     +568     
- Misses       6328     6652     +324     

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d39aa17...7e4de78. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zeljkoX added a commit that referenced this pull request Jun 24, 2026
- Attribute cursor-secret enforcement to the Compose ${VAR:?} expansion rather
  than a server-side prod guard (the server cursor secret is optional on main;
  the hard requirement lands with #301).
- Production compose: require an explicit GUARDIAN_VERSION (drop the :latest
  default) and bind the metrics port to loopback (127.0.0.1:9464).
- Track B smoke: drop the unconfirmed "storage encryption" log grep; rely on
  ECDSA-signer-ready + clean startup.
- Remove the design-spec artifact from the PR (brainstorming doc, not repo
  content).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants