diff --git a/CHANGELOG.md b/CHANGELOG.md index 531f2b7d61..3194df5002 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Security - **`homecore` foundational state-machine review (ADR-127) — one real concurrency bug fixed (state-set TOCTOU dropping/reordering `state_changed` events) + two hardening fixes (entity_id memory-DoS, service-handler panic isolation), each pinned by a fails-on-old test; event-bus lag & lock discipline confirmed clean with evidence.** Beyond-SOTA security+concurrency review of the crate every other HOMECORE module builds on (state store `state.rs`, event bus `bus.rs`, service/entity registries, the `HomeCore` coordinator), un-covered by the ADR-154–159 sweep — a bug here is high-blast-radius. **HC-RACE-01 (state-set TOCTOU, the crux — race/lost-event).** `StateMachine::set` did `get()` (releasing the DashMap shard lock) → compute the next snapshot + the no-op/`last_changed` decision → `insert()` (re-acquiring the lock) → `send()`; the read-modify-write was **not atomic** w.r.t. a concurrent writer on the same entity, contradicting ADR-127 §2.1's promise that "the writer atomically replaces the map entry." A writer that read a **stale `old`** could mis-classify a genuine transition as a no-op and **silently drop its `state_changed` event** (a missed automation trigger) or fire an event whose `new_state` duplicated the previously delivered one (a spurious trigger for any automation keyed on `old_state != new_state`). **Fixed** by holding the shard write-lock across the whole read→decide→insert→fire sequence via `entry()`/`insert_entry()` — `tx.send` is non-blocking, non-async, and never re-enters the map, so firing under the shard lock cannot deadlock and keeps global event order in lock-step with global commit order. Pinned by `concurrent_set_fires_no_duplicate_adjacent_events` (4 writers toggling one entity A/B; asserts no two consecutive fired events carry an identical `new_state` — impossible under correct serialisation; an instrumented probe observed ~93k such duplicate-adjacent events across 200 trials on the racy code, **zero** on the fix; the test fails reliably on the first trial pre-fix). **HC-EID-LEN-01 (unbounded `entity_id`, memory-DoS).** `homecore-api/src/rest.rs` parses untrusted REST path segments straight through `EntityId::parse`; with no length cap an otherwise-valid id (`a.` + many MB of `[a-z0-9_]`) was accepted, and a `POST /api/states/` would persist it into the DashMap state store (permanent growth across distinct ids). **Fixed** by rejecting ids longer than `MAX_ENTITY_ID_LEN` (255, HA-compatible) up front in `parse()`, before any per-char scan, with a new `EntityIdError::TooLong` — fail-closed at the boundary type protects every caller (REST, registry deserialize, automation). Pinned by `entity_id_length_boundary` (exactly-MAX accepted; MAX+1 and a 4 MiB id rejected — oversized parses `Ok` on old code). **HC-SVC-PANIC-01 (service-handler panic not isolated).** `ServiceRegistry::call` already ran handlers **outside** the registry lock (the `Arc` is cloned out of the read guard first → no `RwLock` poisoning, no blocking of other callers — clean), but a panicking handler unwound through `call()` into the caller's task (the task driving the engine). **Hardened** by wrapping the handler future in `AssertUnwindSafe` + `catch_unwind`, converting a panic to `ServiceError::HandlerPanicked`; the registry stays fully usable (a sibling healthy service still returns, the bad service stays registered). Pinned by `panicking_handler_is_isolated_and_registry_survives` (unwinds through `call` on old code). **Dimensions confirmed clean (with evidence, no invented issues):** (1) **event-bus bounds / lag** (the homecore-api WS lag-DoS class) — both `StateMachine` and `EventBus` use **bounded** `tokio::sync::broadcast` (capacity 4,096); a slow subscriber gets a recoverable `Lagged(n)` (drop-oldest + re-sync) while `fire_*` is non-blocking and never waits on slow receivers, so a lagging subscriber **cannot block the publisher, grow the channel without bound, or kill a fast subscriber** (evidenced by `slow_subscriber_does_not_block_publisher_or_kill_the_bus` — fire 3× capacity at an idle subscriber, publisher unblocked, bus stays live, fresh fast subscriber receives, lagged one recovers); (2) **lock ordering / lock-across-await** (deadlock) — no code path holds two of `{state DashMap, registry RwLock, service RwLock}` simultaneously, so no inconsistent-ordering deadlock can exist; every `tokio::sync::RwLock` guard in `registry.rs`/`service.rs` is used in one synchronous statement and dropped before any `.await` (`call` explicitly scopes the read guard out before awaiting the handler); the only guard held across a send is the DashMap shard lock in `set`, across a **synchronous** broadcast send — safe; (3) **panic-on-input** — no reachable `unwrap`/`expect`/index in non-test code beyond the safe `send().unwrap_or(0)` and the dead-but-harmless `split_once(...).unwrap_or(...)` fallbacks on already-validated ids. `cargo test -p homecore --no-default-features`: **20 → 24 passed, 0 failed** (+4 pins). Workspace green; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — `homecore` is off the signal proof path). Review notes appended to ADR-127 §9. +- **`homecore-migrate` security review (ADR-165 surfaces) — one real secret-leak fix; traversal / data-loss / panic / injection dimensions confirmed clean with evidence.** Beyond-SOTA review of the Home-Assistant `.storage`/`secrets.yaml`/`automations.yaml` migrator, the two sharp surfaces being secret handling (`secrets.rs`) and untrusted-file parsing. **Finding + fix (secret-leak, `secrets.rs`):** a malformed `secrets.yaml` whose offending scalar fails a typed-tag coercion (e.g. `port: !!int `) produced a `serde_yaml` error whose message **embeds the scalar verbatim** — `invalid value: string ""`. The old code wrapped that message into `MigrateError::YamlParse { source }`; the error propagates out of `read_secrets`, is `?`-returned by the `InspectSecrets` CLI path in `main.rs`, and printed to stderr by `anyhow` — **leaking a secret value despite the CLI's deliberate `` design** (`main.rs` only ever prints keys as ` = `). Fix: `secrets.yaml` parse failures now map to a dedicated redacting variant `MigrateError::SecretsParse { path, line, column }` carrying only the file path + a coarse location (from `serde_yaml::Error::location()`), never the scalar; other (non-secret) YAML files keep `YamlParse`. **Pinned** by `secrets::tests::malformed_secrets_error_never_contains_secret_value`, which asserts the rendered error **and its full `#[source]` chain** never contain the secret value and that the error is still the structured `SecretsParse` (fail-closed) — it **fails on the old `YamlParse` path** (observed leak: `... invalid value: string "s3cr3t_TOKEN_VALUE" ...`) and passes on the fix; plus `malformed_secrets_error_reports_location` (still locatable). **Confirmed clean with evidence:** *secret leakage elsewhere* — the only secret sink is the value map; `main.rs` redacts values, and the `MissingField`/`Io` paths surface only the path, never content. *Source mutation / data-loss* — **structurally impossible**: there is no `fs::write`/`fs::remove`/`fs::create`/`File::create`/`OpenOptions` anywhere in the crate; P1 reads source and writes nothing (`import-entities` is in-memory only), so re-runs are trivially idempotent and the HA source is never touched. *Path traversal* — CLI takes a `--config-dir`/`--storage` dir and joins **fixed** filenames (`secrets.yaml`, `core.entity_registry`, …); no user-controlled path component, no `..`/absolute escape beyond the user's own privileges. *Panic-on-input* — probed duplicate-key, bad-indent, tab/control-char, multi-doc, non-mapping-root, unterminated-flow, `!input` blueprint tags, deep nesting, anchors: **every** malformed/typed/truncated input **errors, never panics** (all production code is panic-free; every `unwrap`/`expect` is `#[cfg(test)]`). *Fail-closed versioning* — unknown storage `minor_version` hard-errors (no silent fallback to an older parser). *Injection* — no SQL/shell/path interpolation; the tool emits diagnostics only and persists nothing in P1. `homecore-migrate` **19 → 21** tests (`--no-default-features`), 0 failed. Behaviour otherwise unchanged; Python deterministic proof PASS, hash unchanged (`homecore-migrate` is off the signal proof path). - **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32` → `f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path). ### Added diff --git a/docs/adr/ADR-165-homecore-migrate-from-home-assistant.md b/docs/adr/ADR-165-homecore-migrate-from-home-assistant.md index 9c84176468..4f68b2311e 100644 --- a/docs/adr/ADR-165-homecore-migrate-from-home-assistant.md +++ b/docs/adr/ADR-165-homecore-migrate-from-home-assistant.md @@ -78,6 +78,23 @@ converts the entity registry; full conversion of the remaining artifacts is defe - `MigrateError` carries context (`path`, line/field) for I/O, JSON, YAML, missing-field, unsupported-schema-version, and entity-id parse failures (`src/lib.rs`). +- **Secret-leak hardening (security review, 2026-06).** `secrets.yaml` parse failures must + NOT use the generic `MigrateError::YamlParse { source }` variant: `serde_yaml`'s message + for a typed-tag coercion error (e.g. `port: !!int `) embeds the offending scalar + verbatim (`invalid value: string ""`), and that error propagates through + the `InspectSecrets` CLI path to stderr — leaking a secret value despite the CLI's + deliberate `` design. `read_secrets` now maps such failures to a dedicated + redacting variant `MigrateError::SecretsParse { path, line, column }` that carries only the + file path and a coarse location (`serde_yaml::Error::location()`), never the scalar content. + Pinned by `secrets::tests::malformed_secrets_error_never_contains_secret_value` (asserts the + rendered error **and its full `#[source]` chain** never contain the secret value). + **Review dimensions confirmed clean with evidence:** source is never mutated (no + `fs::write`/`remove`/`create` anywhere — P1 reads source, writes nothing); paths are + user-supplied dirs joined with fixed filenames (no `..`/absolute traversal beyond the + user's own privileges); malformed/typed/truncated `.storage` JSON and YAML **error, never + panic** (every production `unwrap`/`expect` is test-only); unknown schema `minor_version` + hard-errors fail-closed; no SQL/shell/path injection surface (the tool emits diagnostics + only, persists nothing in P1). ### 2.5 Deferred to P2+ (NOT built — honestly labelled) @@ -89,7 +106,9 @@ converts the entity registry; full conversion of the remaining artifacts is defe ### 2.6 Test evidence (as shipped) -- 19 tests (`cargo test -p homecore-migrate`), per the crate README badge. +- 21 tests (`cargo test -p homecore-migrate`) — 19 as originally shipped plus 2 added by the + 2026-06 security review (`secrets::tests::malformed_secrets_error_never_contains_secret_value`, + `malformed_secrets_error_reports_location`). ## 3. Consequences diff --git a/v2/crates/homecore-migrate/src/lib.rs b/v2/crates/homecore-migrate/src/lib.rs index 387eddb294..4eff111579 100644 --- a/v2/crates/homecore-migrate/src/lib.rs +++ b/v2/crates/homecore-migrate/src/lib.rs @@ -55,6 +55,25 @@ pub enum MigrateError { source: serde_yaml::Error, }, + /// Parse failure in a SECRET-bearing file (`secrets.yaml`). + /// + /// Unlike [`MigrateError::YamlParse`], this variant deliberately does NOT + /// embed the underlying `serde_yaml::Error` message — that message can quote + /// the offending scalar verbatim (e.g. a typed-tag coercion error renders + /// `invalid value: string ""`), which would leak a secret + /// into stderr/logs. We carry only the file path plus a coarse line/column + /// so the user can locate the problem without the value being printed. + /// (ADR-165 secret-handling rule: a secret value must never appear in output.) + #[error( + "secrets.yaml parse error in {path} (line {line}, column {column}): \ + malformed YAML (value content redacted)" + )] + SecretsParse { + path: String, + line: usize, + column: usize, + }, + /// Fired when the outer `{version, minor_version}` envelope version is /// known but the `minor_version` is not supported by any compiled parser. /// Per ADR-165 §6 Q5: hard error on unknown minor_version. diff --git a/v2/crates/homecore-migrate/src/secrets.rs b/v2/crates/homecore-migrate/src/secrets.rs index 9a26613d25..1c37997651 100644 --- a/v2/crates/homecore-migrate/src/secrets.rs +++ b/v2/crates/homecore-migrate/src/secrets.rs @@ -33,11 +33,19 @@ pub fn read_secrets(path: &Path) -> Result, MigrateError return Ok(HashMap::new()); } - let parsed: serde_yaml::Value = - serde_yaml::from_str(&raw).map_err(|e| MigrateError::YamlParse { + // SECURITY: do NOT use `MigrateError::YamlParse` here. serde_yaml error + // messages can quote the offending scalar verbatim (a typed-tag coercion + // error renders `invalid value: string ""`), and that + // message would be printed to stderr by the CLI — leaking a secret value. + // `MigrateError::SecretsParse` carries only the path + line/column. + let parsed: serde_yaml::Value = serde_yaml::from_str(&raw).map_err(|e| { + let loc = e.location(); + MigrateError::SecretsParse { path: path.display().to_string(), - source: e, - })?; + line: loc.as_ref().map_or(0, |l| l.line()), + column: loc.as_ref().map_or(0, |l| l.column()), + } + })?; let map = match parsed { serde_yaml::Value::Mapping(m) => m, @@ -94,6 +102,59 @@ mod tests { assert!(secrets.is_empty()); } + /// SECURITY regression (fails on the pre-fix `YamlParse` path): a malformed + /// `secrets.yaml` whose offending scalar is a secret value must NOT have that + /// value rendered in the returned error. serde_yaml's own error message for a + /// typed-tag coercion failure embeds the scalar verbatim + /// (`invalid value: string ""`); the old code wrapped that message + /// into `MigrateError::YamlParse { source }`, so `Display` leaked the secret. + #[test] + fn malformed_secrets_error_never_contains_secret_value() { + // `!!int` forces integer coercion of a string scalar; serde_yaml reports + // the scalar text in its message. The scalar here is a stand-in secret. + let yaml = "api_port: !!int s3cr3t_TOKEN_VALUE\n"; + let mut f = NamedTempFile::new().unwrap(); + f.write_all(yaml.as_bytes()).unwrap(); + + let err = read_secrets(f.path()).unwrap_err(); + let rendered = err.to_string(); + + // The secret VALUE must never appear in the error output... + assert!( + !rendered.contains("s3cr3t_TOKEN_VALUE"), + "secret value leaked into error: {rendered}" + ); + // ...and the full chain (with #[source]) must also be clean, since the + // CLI/anyhow prints the source chain too. + let mut source = std::error::Error::source(&err); + while let Some(s) = source { + assert!( + !s.to_string().contains("s3cr3t_TOKEN_VALUE"), + "secret value leaked into error source chain: {s}" + ); + source = s.source(); + } + + // It should still be a structured, locatable error (fail-closed). + assert!( + matches!(err, MigrateError::SecretsParse { .. }), + "expected SecretsParse, got: {err:?}" + ); + } + + /// A secret KEY name is non-sensitive context and is fine to surface, but the + /// redacting error must still help the user locate the problem (line/column). + #[test] + fn malformed_secrets_error_reports_location() { + let yaml = "api_port: !!int notanumber\n"; + let mut f = NamedTempFile::new().unwrap(); + f.write_all(yaml.as_bytes()).unwrap(); + let err = read_secrets(f.path()).unwrap_err(); + let rendered = err.to_string(); + assert!(rendered.contains("line"), "should report a line: {rendered}"); + assert!(rendered.contains("redacted"), "should signal redaction: {rendered}"); + } + #[test] fn secret_count_is_correct() { let yaml = "a: 1\nb: 2\nc: 3\n";