✨ feat(controller): add stateful replica health classifier by nitinstp23 · Pull Request #3 · base-14/memgraph-operator

nitinstp23 · 2026-06-08T09:39:53Z

Context

main and the parallel feat/replica-health-detection branch both descended from the replica-health design spec (83964ef) and independently solved the same "silent replication outage" problem — they're siblings, not parent/child. This PR ports the only piece main genuinely lacked.

main was already equal or ahead on:

per-replica metric cleanup (DeleteClusterMetrics partial-match sweep + RecordReplicaSet sweep-and-rebuild),
negative behind handling (json.Unmarshal into int64),
multi-database "worst-status-wins" aggregation,
legacy (pre-3.x) SHOW REPLICAS format support.

So none of that was ported. The net-new piece is the stateful classifier, re-implemented against main's data model. main's richer per-replica metrics are kept intact; the new gauges are additive.

What's new

replica_classifier.go — a per-replica state machine that distinguishes a freshly-registered replica (Transient, 30s grace) from a genuinely broken data channel (DataChannelDown), plus Behind / BehindTooLong / Invalid / UnknownStatus. It clears the behind timer when the channel is down so behind_seconds doesn't grow without bound. Status is normalized (lowercase/trim) to match the client's summarizeDataInfo.
ReplicationSpec.BehindAlertThreshold (default 5m) — configurable duration-based "behind too long" detection. main previously recorded raw lag but never flagged sustained lag.
Two new gauges — memgraph_replica_data_channel_up and memgraph_replica_behind_seconds, recorded in CheckReplicationHealth and swept on replica unregister / cluster deletion.
Specific, actionable events — ReplicaDataChannelDown / ReplicaInvalid / ReplicaBehindTooLong (with remediation text) replacing the generic ReplicaUnhealthy.

ReplicationManager now holds per-replica timers (in-memory; reset on operator restart) and an injectable clock for deterministic tests. CheckReplicationHealth keeps its existing signature so RecordReplicaSet continues populating main's snapshot metrics.

Bug fixed vs. the source branch

The source branch declared BehindAlertThreshold as a value-typed metav1.Duration. Because omitempty can't omit a struct, a typed Go client (controller-runtime, envtest) sends "0s" when the field is unset, which the strict >0 CEL rule rejects with a 422 — breaking the existing controller envtest. Fixed here by making the field a pointer: unset → omitted → CRD default 5m applies, while explicit 0s/negatives are still rejected.

Behavior change worth noting

A replica within the 30s grace window now counts toward HealthyReplicas (Transient) instead of being immediately Invalid — the intended anti-flapping behavior.

Testing

go build ./..., go vet ./... — clean
go test ./internal/... ./api/... — all green, including the envtest controller reconcile test
New: 17-case classifier table test (grace period, behind-too-long, invalid-wins, timer-clear, status normalization) + gauge record/delete tests
CRD + deepcopy regenerated via make generate manifests (diffs contain only the new field)

Ports the stateful replica health classifier from the parallel feat/replica-health-detection effort onto main's data model. Main was already equal or ahead on parsing, metric cleanup, negative-behind handling and multi-database aggregation, so only the genuinely net-new classifier is brought over; main's richer per-replica metrics are kept and the new gauges are additive. What's new: - Per-replica state machine (replica_classifier.go) distinguishing a freshly-registered replica (Transient, 30s grace) from a broken data channel (DataChannelDown), plus Behind / BehindTooLong / Invalid / UnknownStatus. Clears the behind timer when the channel is down so the behind_seconds gauge does not grow without bound. - Configurable ReplicationSpec.BehindAlertThreshold (default 5m) for duration-based "behind too long" detection. Declared as a pointer so an unset value is omitted and picks up the CRD default rather than serializing as "0s" (a value-typed metav1.Duration is never omitted by omitempty, which would trip the >0 CEL validation on typed-client creates). - New gauges memgraph_replica_data_channel_up and memgraph_replica_behind_seconds, recorded inside CheckReplicationHealth and swept on replica unregister / cluster deletion. - Specific, actionable events (ReplicaDataChannelDown / ReplicaInvalid / ReplicaBehindTooLong) replacing the generic ReplicaUnhealthy. ReplicationManager now holds per-replica timers (in-memory, reset on restart) and an injectable clock for deterministic tests. CheckReplicationHealth keeps its existing signature so RecordReplicaSet continues to populate main's snapshot metrics. Note: a replica within the 30s grace window now counts toward HealthyReplicas (Transient) instead of being immediately Invalid — the intended anti-flapping behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

irfn

LGTM

nitinstp23 self-assigned this Jun 8, 2026

nitinstp23 requested a review from irfn June 8, 2026 09:40

irfn approved these changes Jun 8, 2026

View reviewed changes

nitinstp23 merged commit 02d9320 into main Jun 8, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat(controller): add stateful replica health classifier#3

✨ feat(controller): add stateful replica health classifier#3
nitinstp23 merged 1 commit into
mainfrom
feat/port-replica-health-classifier

nitinstp23 commented Jun 8, 2026

Uh oh!

irfn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nitinstp23 commented Jun 8, 2026

Context

What's new

Bug fixed vs. the source branch

Behavior change worth noting

Testing

Uh oh!

irfn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants