Skip to content

fix(sync-service): drop duplicate relation messages before routing#4560

Open
robacourt wants to merge 3 commits into
mainfrom
rob/drop-duplicate-relations
Open

fix(sync-service): drop duplicate relation messages before routing#4560
robacourt wants to merge 3 commits into
mainfrom
rob/drop-duplicate-relations

Conversation

@robacourt

@robacourt robacourt commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Relation messages can be emitted by Postgres when tables autoanalyze and autovacuum. Previously we checked all shapes to see if the Relation message affected them, which scales O(n) with the number of shapes (taking approx 25ms with 20K shapes). This PR checks to see if the Relation message has actually changed, and only checks the shapes if it has, saving that expensive operation.

Summary

  • classify relation messages as new, unchanged, changed, or identity changed
  • drop unchanged relation messages before shape routing so they do not call Filter.affected_shapes/2
  • keep inspector cache cleanup and partition semantics

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.08%. Comparing base (1219cb4) to head (b96a3a9).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4560       +/-   ##
===========================================
- Coverage   70.96%   58.08%   -12.88%     
===========================================
  Files          83      369      +286     
  Lines        9856    40459    +30603     
  Branches     3124    11469     +8345     
===========================================
+ Hits         6994    23500    +16506     
- Misses       2844    16885    +14041     
- Partials       18       74       +56     
Flag Coverage Δ
packages/agents 71.37% <ø> (ø)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 75.49% <ø> (?)
packages/agents-runtime 82.13% <ø> (?)
packages/agents-server 74.84% <ø> (ø)
packages/agents-server-ui 6.25% <ø> (?)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 58.08% <ø> (-12.88%) ⬇️
unit-tests 58.08% <ø> (-12.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@netlify

netlify Bot commented Jun 11, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit b96a3a9
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a2a9d8a0237460008173143
😎 Deploy Preview https://deploy-preview-4560--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@robacourt robacourt requested review from alco and Copilot and removed request for Copilot June 11, 2026 08:49
@robacourt robacourt self-assigned this Jun 11, 2026
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Incremental review (iteration 2). Since the first pass, two commits — Fix relation tracker commit ordering and Add test — have landed. They fully resolve the critical error-path regression raised in iteration 1 and in @erik-the-implementer's follow-up analysis. The optimization (skip Filter.affected_shapes/2 for :unchanged relations) is intact and the recovery semantics are now safe in the failure direction. No new issues found.

What's Working Well

  • Critical issue resolved. handle_relation/2 now commits the tracker last: Partitions.handle_relation/2 runs first, then publish_relation/3, and only on success are PersistentReplicationState.set_tracked_relations/2 (disk) and the in-memory tracked_relations advanced (shape_log_collector.ex:647-663). This matches the suggested fix exactly and makes both failure windows safe:
    • A {:error, :connection_not_available} from partition inspection returns the original state, so the retried relation re-classifies as :changed/:new and is routed rather than dropped as :unchanged.
    • A crash after publish but before persist replays the relation as :changed on restart — redundant but harmless, never silently dropped.
  • Both windows are now tested, which is exactly where the regression lived:
    • "retries changed relation after partition inspection connection error" (shape_log_collector_test.exs:893) drives load_relation_info to fail once, asserts the {:error, :connection_not_available} propagates, then re-sends and asserts the change is consumed.
    • "does not persist changed relation before routing completes" (shape_log_collector_test.exs:932) patches Filter.affected_shapes/2 to raise mid-route and asserts the persisted tracker still holds the old relation — a precise guard on the commit ordering.
  • The consumer_test.exs test was tightened to use the real shape's root_table_id/table and now patches Filter.affected_shapes/2 to raise on the duplicate, directly proving the expensive path is skipped for an unchanged relation that genuinely matches a live shape (the earlier version used a non-matching synthetic relation).
  • relation_status classification (:new/:unchanged/:changed/:identity_changed) reads cleanly and the affected_columns_test.exs cases now assert the tag explicitly, including a dedicated :unchanged case.
  • Changeset present for @core/sync-service. ✅

Issues Found

Critical (Must Fix)

None. The iteration-1 critical issue is resolved.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Partitions.handle_relation/2 (and its inspector reload) still runs for :unchanged relations on the hot path

File: packages/sync-service/lib/electric/replication/shape_log_collector.ex:643-663

For an :unchanged relation (the autovacuum/autoanalyze case this PR targets), rel == updated_rel is true, so Inspector.clean/2 clears the cache (line 644) and then Partitions.handle_relation/2 calls Inspector.load_relation_info/2, re-fetching from PG (partitions.ex:126). Then set_tracked_relations/2 persists a tracker_state that is byte-for-byte identical to the current one.

This is not a regression — the same clean + reload + persist happened before this PR; the PR only removed the O(n) affected_shapes loop, which is the 25ms cost it set out to eliminate, so the stated goal is achieved. But since the PR is explicitly about making the high-frequency unchanged path cheap, there may be further wins available: when relation_status == :unchanged, tracker_state is unchanged, so the persist (and the in-memory re-commit) could be skipped, and the cache-clean-then-reload round-trip on every autovacuum message is worth a second look. Out of scope for this fix; noting it for a potential follow-up.

No linked issue

Still no linked issue/incident. The PR description explains the motivation (~25ms per relation at 20K shapes) and the implementation matches it; consider linking the relevant performance ticket per project convention.

Issue Conformance

No linked issue (see above). The implementation matches the PR description, and the optimization is now backed by a direct regression guard in consumer_test.exs. The previously-noted correctness gap is closed.

Previous Review Status

  • Critical — tracked_relations advanced before Partitions.handle_relation/2: ✅ Resolved by Fix relation tracker commit ordering. Tracker (in-memory and persisted) is now committed only after partition handling and routing succeed.
  • Restart-with-early-persisted-tracker window (raised by @erik-the-implementer): ✅ Resolved by the same reorder — persist now happens after publish, so a crash before persist replays as :changed.
  • Suggestion — Partitions.handle_relation/2 runs when subscriptions == 0: Still present (the subscriptions check now lives inside publish_relation/3, after the partition call). Low priority; restated above in a broader form.

Review iteration: 2 | 2026-06-11

@erik-the-implementer

Copy link
Copy Markdown
Contributor

Note

This review was generated by Claude on request by @alco.

Dropping :unchanged relation messages is only safe under one invariant: the tracker state in tracked_relations is always consistent with the stored identity of every live shape. Routing "unchanged" relations through Filter.affected_shapes/2 was never a no-op by accident — it was the recovery mechanism for when that invariant breaks. Each shape compared its own root_table_id / name / column count against the message, so a shape that missed an earlier relation change got terminated on the next (identical) message. This PR removes that net, and there are two concrete windows where the invariant breaks:

1. Retry after connection_not_available (in-memory tracker advanced too early)

tracked_relations is committed into state (shape_log_collector.ex:653) before Partitions.handle_relation/2. When that returns {:error, :connection_not_available}, the error path returns the already-advanced state, and the replication client retries the same event (replication_client.ex:371-380). On retry the relation matches the tracker → :unchanged → dropped. A real schema change never reaches consumers.

2. Restart with the early-persisted tracker (not fixed by addressing #1 alone)

PersistentReplicationState.set_tracked_relations/2 persists the advanced tracker before partition handling and before publish/2. That ordering is pre-existing, but it used to be harmless precisely because unchanged relations were still routed. Now:

  • a :changed relation arrives → tracker persisted → Electric crashes/redeploys before consumers process the publish (or the handler errors and the stack restarts during the retry window — the :DOWN retry path in replication_client.ex shows handler crashes are expected and retried);
  • on restart, shapes reload stale, the tracker reloads already containing the new relation (shape_log_collector.ex:229);
  • PG re-sends the relation message on the new session → matches persisted tracker → :unchanged → dropped.

The stale shapes (select-all shape with old column count, shape on a renamed table, …) are never terminated and keep serving the wrong schema until the next actual schema change. Previously this exact sequence self-healed via the per-shape recheck.

Fix

Commit the tracker — both in-memory and persisted — only after Partitions.handle_relation/2 succeeds and publish/2 returns (publish is synchronous per dependency layer, so this is a straightforward reorder):

case Partitions.handle_relation(state.partitions, updated_rel) do
  {:ok, partitions} ->
    state = %{state | partitions: partitions}

    with {:ok, state} <- publish_relation(state, updated_rel, relation_status) do
      :ok =
        PersistentReplicationState.set_tracked_relations(
          tracker_state,
          state.persistent_replication_data_opts
        )

      {:ok, %{state | tracked_relations: tracker_state}}
    end

  {:error, :connection_not_available} ->
    {{:error, :connection_not_available}, state}
end

This makes the failure direction safe: a crash after publish but before persist means the relation re-classifies as :changed on replay and is re-routed — redundant but harmless — instead of silently dropped.

Tests

Both windows are currently uncovered, and they're exactly where this PR changes recovery semantics:

  • have Partitions.handle_relation/2 return {:error, :connection_not_available} for a changed relation, re-send it, assert it is routed;
  • persist a changed relation into PersistentReplicationState, restart the collector with shapes whose stored identity predates the change, re-send the same relation message, assert the affected shape is terminated.

(For the record: the steady-state optimization itself is sound. Info that's missing from relation messages entirely — nullability etc. — was never detectable by the shape recheck anyway; that's handled by Inspector.clean/2, which this PR correctly preserves for the :unchanged case.)

@robacourt robacourt force-pushed the rob/drop-duplicate-relations branch from 141952c to b96a3a9 Compare June 11, 2026 11:35
@robacourt

robacourt commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the tracker-ordering review feedback in b96a3a9.

Changes:

  • tracked_relations is no longer persisted or installed in memory before partition handling and relation publish complete.
  • On :connection_not_available, the collector returns the original tracker state so the retry reclassifies changed relations correctly.
  • Added regression coverage for retry after partition inspection failure and for not persisting a changed relation before routing completes.

Reran:

  • mix test test/electric/replication/shape_log_collector_test.exs
  • mix test test/electric/replication/shape_log_collector/affected_columns_test.exs
  • mix test test/electric/shapes/consumer_test.exs
  • git diff --check && git diff --cached --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants