Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Security
- **`homecore` foundational state-machine review (ADR-127) — one real concurrency bug fixed (state-set TOCTOU dropping/reordering `state_changed` events) + two hardening fixes (entity_id memory-DoS, service-handler panic isolation), each pinned by a fails-on-old test; event-bus lag & lock discipline confirmed clean with evidence.** Beyond-SOTA security+concurrency review of the crate every other HOMECORE module builds on (state store `state.rs`, event bus `bus.rs`, service/entity registries, the `HomeCore` coordinator), un-covered by the ADR-154–159 sweep — a bug here is high-blast-radius. **HC-RACE-01 (state-set TOCTOU, the crux — race/lost-event).** `StateMachine::set` did `get()` (releasing the DashMap shard lock) → compute the next snapshot + the no-op/`last_changed` decision → `insert()` (re-acquiring the lock) → `send()`; the read-modify-write was **not atomic** w.r.t. a concurrent writer on the same entity, contradicting ADR-127 §2.1's promise that "the writer atomically replaces the map entry." A writer that read a **stale `old`** could mis-classify a genuine transition as a no-op and **silently drop its `state_changed` event** (a missed automation trigger) or fire an event whose `new_state` duplicated the previously delivered one (a spurious trigger for any automation keyed on `old_state != new_state`). **Fixed** by holding the shard write-lock across the whole read→decide→insert→fire sequence via `entry()`/`insert_entry()` — `tx.send` is non-blocking, non-async, and never re-enters the map, so firing under the shard lock cannot deadlock and keeps global event order in lock-step with global commit order. Pinned by `concurrent_set_fires_no_duplicate_adjacent_events` (4 writers toggling one entity A/B; asserts no two consecutive fired events carry an identical `new_state` — impossible under correct serialisation; an instrumented probe observed ~93k such duplicate-adjacent events across 200 trials on the racy code, **zero** on the fix; the test fails reliably on the first trial pre-fix). **HC-EID-LEN-01 (unbounded `entity_id`, memory-DoS).** `homecore-api/src/rest.rs` parses untrusted REST path segments straight through `EntityId::parse`; with no length cap an otherwise-valid id (`a.` + many MB of `[a-z0-9_]`) was accepted, and a `POST /api/states/<giant>` would persist it into the DashMap state store (permanent growth across distinct ids). **Fixed** by rejecting ids longer than `MAX_ENTITY_ID_LEN` (255, HA-compatible) up front in `parse()`, before any per-char scan, with a new `EntityIdError::TooLong` — fail-closed at the boundary type protects every caller (REST, registry deserialize, automation). Pinned by `entity_id_length_boundary` (exactly-MAX accepted; MAX+1 and a 4 MiB id rejected — oversized parses `Ok` on old code). **HC-SVC-PANIC-01 (service-handler panic not isolated).** `ServiceRegistry::call` already ran handlers **outside** the registry lock (the `Arc<dyn ServiceHandler>` is cloned out of the read guard first → no `RwLock` poisoning, no blocking of other callers — clean), but a panicking handler unwound through `call()` into the caller's task (the task driving the engine). **Hardened** by wrapping the handler future in `AssertUnwindSafe` + `catch_unwind`, converting a panic to `ServiceError::HandlerPanicked`; the registry stays fully usable (a sibling healthy service still returns, the bad service stays registered). Pinned by `panicking_handler_is_isolated_and_registry_survives` (unwinds through `call` on old code). **Dimensions confirmed clean (with evidence, no invented issues):** (1) **event-bus bounds / lag** (the homecore-api WS lag-DoS class) — both `StateMachine` and `EventBus` use **bounded** `tokio::sync::broadcast` (capacity 4,096); a slow subscriber gets a recoverable `Lagged(n)` (drop-oldest + re-sync) while `fire_*` is non-blocking and never waits on slow receivers, so a lagging subscriber **cannot block the publisher, grow the channel without bound, or kill a fast subscriber** (evidenced by `slow_subscriber_does_not_block_publisher_or_kill_the_bus` — fire 3× capacity at an idle subscriber, publisher unblocked, bus stays live, fresh fast subscriber receives, lagged one recovers); (2) **lock ordering / lock-across-await** (deadlock) — no code path holds two of `{state DashMap, registry RwLock, service RwLock}` simultaneously, so no inconsistent-ordering deadlock can exist; every `tokio::sync::RwLock` guard in `registry.rs`/`service.rs` is used in one synchronous statement and dropped before any `.await` (`call` explicitly scopes the read guard out before awaiting the handler); the only guard held across a send is the DashMap shard lock in `set`, across a **synchronous** broadcast send — safe; (3) **panic-on-input** — no reachable `unwrap`/`expect`/index in non-test code beyond the safe `send().unwrap_or(0)` and the dead-but-harmless `split_once(...).unwrap_or(...)` fallbacks on already-validated ids. `cargo test -p homecore --no-default-features`: **20 → 24 passed, 0 failed** (+4 pins). Workspace green; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — `homecore` is off the signal proof path). Review notes appended to ADR-127 §9.
- **`homecore-migrate` security review (ADR-165 surfaces) — one real secret-leak fix; traversal / data-loss / panic / injection dimensions confirmed clean with evidence.** Beyond-SOTA review of the Home-Assistant `.storage`/`secrets.yaml`/`automations.yaml` migrator, the two sharp surfaces being secret handling (`secrets.rs`) and untrusted-file parsing. **Finding + fix (secret-leak, `secrets.rs`):** a malformed `secrets.yaml` whose offending scalar fails a typed-tag coercion (e.g. `port: !!int <value>`) produced a `serde_yaml` error whose message **embeds the scalar verbatim** — `invalid value: string "<the-secret-value>"`. The old code wrapped that message into `MigrateError::YamlParse { source }`; the error propagates out of `read_secrets`, is `?`-returned by the `InspectSecrets` CLI path in `main.rs`, and printed to stderr by `anyhow` — **leaking a secret value despite the CLI's deliberate `<redacted>` design** (`main.rs` only ever prints keys as `<key> = <redacted>`). Fix: `secrets.yaml` parse failures now map to a dedicated redacting variant `MigrateError::SecretsParse { path, line, column }` carrying only the file path + a coarse location (from `serde_yaml::Error::location()`), never the scalar; other (non-secret) YAML files keep `YamlParse`. **Pinned** by `secrets::tests::malformed_secrets_error_never_contains_secret_value`, which asserts the rendered error **and its full `#[source]` chain** never contain the secret value and that the error is still the structured `SecretsParse` (fail-closed) — it **fails on the old `YamlParse` path** (observed leak: `... invalid value: string "s3cr3t_TOKEN_VALUE" ...`) and passes on the fix; plus `malformed_secrets_error_reports_location` (still locatable). **Confirmed clean with evidence:** *secret leakage elsewhere* — the only secret sink is the value map; `main.rs` redacts values, and the `MissingField`/`Io` paths surface only the path, never content. *Source mutation / data-loss* — **structurally impossible**: there is no `fs::write`/`fs::remove`/`fs::create`/`File::create`/`OpenOptions` anywhere in the crate; P1 reads source and writes nothing (`import-entities` is in-memory only), so re-runs are trivially idempotent and the HA source is never touched. *Path traversal* — CLI takes a `--config-dir`/`--storage` dir and joins **fixed** filenames (`secrets.yaml`, `core.entity_registry`, …); no user-controlled path component, no `..`/absolute escape beyond the user's own privileges. *Panic-on-input* — probed duplicate-key, bad-indent, tab/control-char, multi-doc, non-mapping-root, unterminated-flow, `!input` blueprint tags, deep nesting, anchors: **every** malformed/typed/truncated input **errors, never panics** (all production code is panic-free; every `unwrap`/`expect` is `#[cfg(test)]`). *Fail-closed versioning* — unknown storage `minor_version` hard-errors (no silent fallback to an older parser). *Injection* — no SQL/shell/path interpolation; the tool emits diagnostics only and persists nothing in P1. `homecore-migrate` **19 → 21** tests (`--no-default-features`), 0 failed. Behaviour otherwise unchanged; Python deterministic proof PASS, hash unchanged (`homecore-migrate` is off the signal proof path).
- **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32` → `f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path).

### Added
Expand Down
21 changes: 20 additions & 1 deletion docs/adr/ADR-165-homecore-migrate-from-home-assistant.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,23 @@ converts the entity registry; full conversion of the remaining artifacts is defe

- `MigrateError` carries context (`path`, line/field) for I/O, JSON, YAML, missing-field,
unsupported-schema-version, and entity-id parse failures (`src/lib.rs`).
- **Secret-leak hardening (security review, 2026-06).** `secrets.yaml` parse failures must
NOT use the generic `MigrateError::YamlParse { source }` variant: `serde_yaml`'s message
for a typed-tag coercion error (e.g. `port: !!int <value>`) embeds the offending scalar
verbatim (`invalid value: string "<the-secret-value>"`), and that error propagates through
the `InspectSecrets` CLI path to stderr — leaking a secret value despite the CLI's
deliberate `<redacted>` design. `read_secrets` now maps such failures to a dedicated
redacting variant `MigrateError::SecretsParse { path, line, column }` that carries only the
file path and a coarse location (`serde_yaml::Error::location()`), never the scalar content.
Pinned by `secrets::tests::malformed_secrets_error_never_contains_secret_value` (asserts the
rendered error **and its full `#[source]` chain** never contain the secret value).
**Review dimensions confirmed clean with evidence:** source is never mutated (no
`fs::write`/`remove`/`create` anywhere — P1 reads source, writes nothing); paths are
user-supplied dirs joined with fixed filenames (no `..`/absolute traversal beyond the
user's own privileges); malformed/typed/truncated `.storage` JSON and YAML **error, never
panic** (every production `unwrap`/`expect` is test-only); unknown schema `minor_version`
hard-errors fail-closed; no SQL/shell/path injection surface (the tool emits diagnostics
only, persists nothing in P1).

### 2.5 Deferred to P2+ (NOT built — honestly labelled)

Expand All @@ -89,7 +106,9 @@ converts the entity registry; full conversion of the remaining artifacts is defe

### 2.6 Test evidence (as shipped)

- 19 tests (`cargo test -p homecore-migrate`), per the crate README badge.
- 21 tests (`cargo test -p homecore-migrate`) — 19 as originally shipped plus 2 added by the
2026-06 security review (`secrets::tests::malformed_secrets_error_never_contains_secret_value`,
`malformed_secrets_error_reports_location`).

## 3. Consequences

Expand Down
19 changes: 19 additions & 0 deletions v2/crates/homecore-migrate/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,25 @@ pub enum MigrateError {
source: serde_yaml::Error,
},

/// Parse failure in a SECRET-bearing file (`secrets.yaml`).
///
/// Unlike [`MigrateError::YamlParse`], this variant deliberately does NOT
/// embed the underlying `serde_yaml::Error` message — that message can quote
/// the offending scalar verbatim (e.g. a typed-tag coercion error renders
/// `invalid value: string "<the-secret-value>"`), which would leak a secret
/// into stderr/logs. We carry only the file path plus a coarse line/column
/// so the user can locate the problem without the value being printed.
/// (ADR-165 secret-handling rule: a secret value must never appear in output.)
#[error(
"secrets.yaml parse error in {path} (line {line}, column {column}): \
malformed YAML (value content redacted)"
)]
SecretsParse {
path: String,
line: usize,
column: usize,
},

/// Fired when the outer `{version, minor_version}` envelope version is
/// known but the `minor_version` is not supported by any compiled parser.
/// Per ADR-165 §6 Q5: hard error on unknown minor_version.
Expand Down
69 changes: 65 additions & 4 deletions v2/crates/homecore-migrate/src/secrets.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,19 @@ pub fn read_secrets(path: &Path) -> Result<HashMap<String, String>, MigrateError
return Ok(HashMap::new());
}

let parsed: serde_yaml::Value =
serde_yaml::from_str(&raw).map_err(|e| MigrateError::YamlParse {
// SECURITY: do NOT use `MigrateError::YamlParse` here. serde_yaml error
// messages can quote the offending scalar verbatim (a typed-tag coercion
// error renders `invalid value: string "<the-secret-value>"`), and that
// message would be printed to stderr by the CLI — leaking a secret value.
// `MigrateError::SecretsParse` carries only the path + line/column.
let parsed: serde_yaml::Value = serde_yaml::from_str(&raw).map_err(|e| {
let loc = e.location();
MigrateError::SecretsParse {
path: path.display().to_string(),
source: e,
})?;
line: loc.as_ref().map_or(0, |l| l.line()),
column: loc.as_ref().map_or(0, |l| l.column()),
}
})?;

let map = match parsed {
serde_yaml::Value::Mapping(m) => m,
Expand Down Expand Up @@ -94,6 +102,59 @@ mod tests {
assert!(secrets.is_empty());
}

/// SECURITY regression (fails on the pre-fix `YamlParse` path): a malformed
/// `secrets.yaml` whose offending scalar is a secret value must NOT have that
/// value rendered in the returned error. serde_yaml's own error message for a
/// typed-tag coercion failure embeds the scalar verbatim
/// (`invalid value: string "<secret>"`); the old code wrapped that message
/// into `MigrateError::YamlParse { source }`, so `Display` leaked the secret.
#[test]
fn malformed_secrets_error_never_contains_secret_value() {
// `!!int` forces integer coercion of a string scalar; serde_yaml reports
// the scalar text in its message. The scalar here is a stand-in secret.
let yaml = "api_port: !!int s3cr3t_TOKEN_VALUE\n";
let mut f = NamedTempFile::new().unwrap();
f.write_all(yaml.as_bytes()).unwrap();

let err = read_secrets(f.path()).unwrap_err();
let rendered = err.to_string();

// The secret VALUE must never appear in the error output...
assert!(
!rendered.contains("s3cr3t_TOKEN_VALUE"),
"secret value leaked into error: {rendered}"
);
// ...and the full chain (with #[source]) must also be clean, since the
// CLI/anyhow prints the source chain too.
let mut source = std::error::Error::source(&err);
while let Some(s) = source {
assert!(
!s.to_string().contains("s3cr3t_TOKEN_VALUE"),
"secret value leaked into error source chain: {s}"
);
source = s.source();
}

// It should still be a structured, locatable error (fail-closed).
assert!(
matches!(err, MigrateError::SecretsParse { .. }),
"expected SecretsParse, got: {err:?}"
);
}

/// A secret KEY name is non-sensitive context and is fine to surface, but the
/// redacting error must still help the user locate the problem (line/column).
#[test]
fn malformed_secrets_error_reports_location() {
let yaml = "api_port: !!int notanumber\n";
let mut f = NamedTempFile::new().unwrap();
f.write_all(yaml.as_bytes()).unwrap();
let err = read_secrets(f.path()).unwrap_err();
let rendered = err.to_string();
assert!(rendered.contains("line"), "should report a line: {rendered}");
assert!(rendered.contains("redacted"), "should signal redaction: {rendered}");
}

#[test]
fn secret_count_is_correct() {
let yaml = "a: 1\nb: 2\nc: 3\n";
Expand Down
Loading