fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics#2
Merged
Merged
Conversation
Parse the v2.21+/v3.x SHOW REPLICAS format (with legacy fallback) and classify empty data_info as invalid — a registered, heartbeating replica that streams no data is now reported unhealthy. Expose per-replica health, status, behind-count, and main-vs-replica vertex/edge drift as Prometheus metrics, and require all registered replicas to be streaming for the ReplicationHealthy condition.
Add Issues 5-7 (epoch_id deletion by init container, missing replica remediation, -read service selecting broken replicas) and correct Issue 1's premise: empty data_info means streaming never engaged, not a parser artifact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While debugging intermittent empty query results in a production cluster, we found a replica that had been registered and heartbeating for 104+ days while
replicating zero data (SHOW REPLICAS → data_info: {}). The operator never surfaced this: the parser read the wrong column, and both the ReplicationHealthy
condition and existing metrics were based on registration count, not actual data streaming.
Changes
Parser fix (FIXES-REQUIRED.md Issue 1) — internal/memgraph/client.go
artifact
New replication metrics — internal/controller/metrics.go
memgraph_replica_data_info_present
Condition correctness — ReplicationHealthy now requires every registered replica to be streaming (healthy >= registered), not just registered >= expected; message
reports both counts
Breaking / behavioral notes
intended detection of the production failure mode
Testing
Rollout
Version bumped to 0.2.0. Helm chart 0.1.4 (appVersion bump only) and infra canary (single cluster via per-cluster values override; collector scrape config added
for the one cluster missing it) go in separate PRs.