Skip to content

✨ feat(controller): add stateful replica health classifier#3

Merged
nitinstp23 merged 1 commit into
mainfrom
feat/port-replica-health-classifier
Jun 8, 2026
Merged

✨ feat(controller): add stateful replica health classifier#3
nitinstp23 merged 1 commit into
mainfrom
feat/port-replica-health-classifier

Conversation

@nitinstp23

Copy link
Copy Markdown
Contributor

Context

main and the parallel feat/replica-health-detection branch both descended from the replica-health design spec (83964ef) and independently solved the same "silent replication outage" problem — they're siblings, not parent/child. This PR ports the only piece main genuinely lacked.

main was already equal or ahead on:

  • per-replica metric cleanup (DeleteClusterMetrics partial-match sweep + RecordReplicaSet sweep-and-rebuild),
  • negative behind handling (json.Unmarshal into int64),
  • multi-database "worst-status-wins" aggregation,
  • legacy (pre-3.x) SHOW REPLICAS format support.

So none of that was ported. The net-new piece is the stateful classifier, re-implemented against main's data model. main's richer per-replica metrics are kept intact; the new gauges are additive.

What's new

  • replica_classifier.go — a per-replica state machine that distinguishes a freshly-registered replica (Transient, 30s grace) from a genuinely broken data channel (DataChannelDown), plus Behind / BehindTooLong / Invalid / UnknownStatus. It clears the behind timer when the channel is down so behind_seconds doesn't grow without bound. Status is normalized (lowercase/trim) to match the client's summarizeDataInfo.
  • ReplicationSpec.BehindAlertThreshold (default 5m) — configurable duration-based "behind too long" detection. main previously recorded raw lag but never flagged sustained lag.
  • Two new gaugesmemgraph_replica_data_channel_up and memgraph_replica_behind_seconds, recorded in CheckReplicationHealth and swept on replica unregister / cluster deletion.
  • Specific, actionable eventsReplicaDataChannelDown / ReplicaInvalid / ReplicaBehindTooLong (with remediation text) replacing the generic ReplicaUnhealthy.

ReplicationManager now holds per-replica timers (in-memory; reset on operator restart) and an injectable clock for deterministic tests. CheckReplicationHealth keeps its existing signature so RecordReplicaSet continues populating main's snapshot metrics.

Bug fixed vs. the source branch

The source branch declared BehindAlertThreshold as a value-typed metav1.Duration. Because omitempty can't omit a struct, a typed Go client (controller-runtime, envtest) sends "0s" when the field is unset, which the strict >0 CEL rule rejects with a 422 — breaking the existing controller envtest. Fixed here by making the field a pointer: unset → omitted → CRD default 5m applies, while explicit 0s/negatives are still rejected.

Behavior change worth noting

A replica within the 30s grace window now counts toward HealthyReplicas (Transient) instead of being immediately Invalid — the intended anti-flapping behavior.

Testing

  • go build ./..., go vet ./... — clean
  • go test ./internal/... ./api/... — all green, including the envtest controller reconcile test
  • New: 17-case classifier table test (grace period, behind-too-long, invalid-wins, timer-clear, status normalization) + gauge record/delete tests
  • CRD + deepcopy regenerated via make generate manifests (diffs contain only the new field)

Ports the stateful replica health classifier from the parallel
feat/replica-health-detection effort onto main's data model. Main was
already equal or ahead on parsing, metric cleanup, negative-behind
handling and multi-database aggregation, so only the genuinely net-new
classifier is brought over; main's richer per-replica metrics are kept
and the new gauges are additive.

What's new:
- Per-replica state machine (replica_classifier.go) distinguishing a
  freshly-registered replica (Transient, 30s grace) from a broken data
  channel (DataChannelDown), plus Behind / BehindTooLong / Invalid /
  UnknownStatus. Clears the behind timer when the channel is down so the
  behind_seconds gauge does not grow without bound.
- Configurable ReplicationSpec.BehindAlertThreshold (default 5m) for
  duration-based "behind too long" detection. Declared as a pointer so
  an unset value is omitted and picks up the CRD default rather than
  serializing as "0s" (a value-typed metav1.Duration is never omitted by
  omitempty, which would trip the >0 CEL validation on typed-client
  creates).
- New gauges memgraph_replica_data_channel_up and
  memgraph_replica_behind_seconds, recorded inside CheckReplicationHealth
  and swept on replica unregister / cluster deletion.
- Specific, actionable events (ReplicaDataChannelDown / ReplicaInvalid /
  ReplicaBehindTooLong) replacing the generic ReplicaUnhealthy.

ReplicationManager now holds per-replica timers (in-memory, reset on
restart) and an injectable clock for deterministic tests.
CheckReplicationHealth keeps its existing signature so RecordReplicaSet
continues to populate main's snapshot metrics.

Note: a replica within the 30s grace window now counts toward
HealthyReplicas (Transient) instead of being immediately Invalid — the
intended anti-flapping behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nitinstp23 nitinstp23 self-assigned this Jun 8, 2026
@nitinstp23 nitinstp23 requested a review from irfn June 8, 2026 09:40

@irfn irfn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nitinstp23 nitinstp23 merged commit 02d9320 into main Jun 8, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants