✨ feat(controller): add stateful replica health classifier#3
Merged
Conversation
Ports the stateful replica health classifier from the parallel feat/replica-health-detection effort onto main's data model. Main was already equal or ahead on parsing, metric cleanup, negative-behind handling and multi-database aggregation, so only the genuinely net-new classifier is brought over; main's richer per-replica metrics are kept and the new gauges are additive. What's new: - Per-replica state machine (replica_classifier.go) distinguishing a freshly-registered replica (Transient, 30s grace) from a broken data channel (DataChannelDown), plus Behind / BehindTooLong / Invalid / UnknownStatus. Clears the behind timer when the channel is down so the behind_seconds gauge does not grow without bound. - Configurable ReplicationSpec.BehindAlertThreshold (default 5m) for duration-based "behind too long" detection. Declared as a pointer so an unset value is omitted and picks up the CRD default rather than serializing as "0s" (a value-typed metav1.Duration is never omitted by omitempty, which would trip the >0 CEL validation on typed-client creates). - New gauges memgraph_replica_data_channel_up and memgraph_replica_behind_seconds, recorded inside CheckReplicationHealth and swept on replica unregister / cluster deletion. - Specific, actionable events (ReplicaDataChannelDown / ReplicaInvalid / ReplicaBehindTooLong) replacing the generic ReplicaUnhealthy. ReplicationManager now holds per-replica timers (in-memory, reset on restart) and an injectable clock for deterministic tests. CheckReplicationHealth keeps its existing signature so RecordReplicaSet continues to populate main's snapshot metrics. Note: a replica within the 30s grace window now counts toward HealthyReplicas (Transient) instead of being immediately Invalid — the intended anti-flapping behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
mainand the parallelfeat/replica-health-detectionbranch both descended from the replica-health design spec (83964ef) and independently solved the same "silent replication outage" problem — they're siblings, not parent/child. This PR ports the only piecemaingenuinely lacked.mainwas already equal or ahead on:DeleteClusterMetricspartial-match sweep +RecordReplicaSetsweep-and-rebuild),behindhandling (json.Unmarshalintoint64),SHOW REPLICASformat support.So none of that was ported. The net-new piece is the stateful classifier, re-implemented against
main's data model.main's richer per-replica metrics are kept intact; the new gauges are additive.What's new
replica_classifier.go— a per-replica state machine that distinguishes a freshly-registered replica (Transient, 30s grace) from a genuinely broken data channel (DataChannelDown), plusBehind/BehindTooLong/Invalid/UnknownStatus. It clears the behind timer when the channel is down sobehind_secondsdoesn't grow without bound. Status is normalized (lowercase/trim) to match the client'ssummarizeDataInfo.ReplicationSpec.BehindAlertThreshold(default5m) — configurable duration-based "behind too long" detection.mainpreviously recorded raw lag but never flagged sustained lag.memgraph_replica_data_channel_upandmemgraph_replica_behind_seconds, recorded inCheckReplicationHealthand swept on replica unregister / cluster deletion.ReplicaDataChannelDown/ReplicaInvalid/ReplicaBehindTooLong(with remediation text) replacing the genericReplicaUnhealthy.ReplicationManagernow holds per-replica timers (in-memory; reset on operator restart) and an injectable clock for deterministic tests.CheckReplicationHealthkeeps its existing signature soRecordReplicaSetcontinues populatingmain's snapshot metrics.Bug fixed vs. the source branch
The source branch declared
BehindAlertThresholdas a value-typedmetav1.Duration. Becauseomitemptycan't omit a struct, a typed Go client (controller-runtime, envtest) sends"0s"when the field is unset, which the strict>0CEL rule rejects with a 422 — breaking the existing controller envtest. Fixed here by making the field a pointer: unset → omitted → CRD default5mapplies, while explicit0s/negatives are still rejected.Behavior change worth noting
A replica within the 30s grace window now counts toward
HealthyReplicas(Transient) instead of being immediatelyInvalid— the intended anti-flapping behavior.Testing
go build ./...,go vet ./...— cleango test ./internal/... ./api/...— all green, including the envtest controller reconcile testmake generate manifests(diffs contain only the new field)