Skip to content

fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics#2

Merged
nitinstp23 merged 3 commits into
mainfrom
nitin/replication-health-metrics
Jun 5, 2026
Merged

fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics#2
nitinstp23 merged 3 commits into
mainfrom
nitin/replication-health-metrics

Conversation

@nitinstp23

Copy link
Copy Markdown
Contributor

Summary

While debugging intermittent empty query results in a production cluster, we found a replica that had been registered and heartbeating for 104+ days while
replicating zero data (SHOW REPLICAS → data_info: {}). The operator never surfaced this: the parser read the wrong column, and both the ReplicationHealthy
condition and existing metrics were based on registration count, not actual data streaming.

Changes

Parser fix (FIXES-REQUIRED.md Issue 1) — internal/memgraph/client.go

  • Parse the current Memgraph format (name | socket_address | sync_mode | system_info | data_info), with per-row fallback for the legacy pre-v2.21 format
  • Parse data_info into per-database {behind, status, ts}; empty data_info ({}/Null) classifies as invalid — it means data streaming never engaged, not a parse
    artifact
  • ReplicaInfo gains SocketAddress/SyncMode/DataInfo/Behind/Status/Timestamp plus IsHealthy() / DataInfoPresent()

New replication metrics — internal/controller/metrics.go

  • Per-replica: memgraph_replica_healthy, memgraph_replica_status, memgraph_replica_behind_count, memgraph_replica_last_confirmed_timestamp_seconds,
    memgraph_replica_data_info_present
  • Rollups: memgraph_cluster_replicas_total, memgraph_cluster_replicas_healthy_total; memgraph_replication_healthy now means "all registered replicas streaming"
  • Drift: memgraph_replication_vertex_drift / memgraph_replication_edge_drift (main minus replica, from existing SHOW STORAGE INFO collection)
  • Stale series cleanup via DeletePartialMatch on replica removal and cluster deletion

Condition correctness — ReplicationHealthy now requires every registered replica to be streaming (healthy >= registered), not just registered >= expected; message
reports both counts

Breaking / behavioral notes

  • Clusters with registered-but-not-streaming replicas will now (correctly) report ReplicationHealthy=False and emit unhealthy-replica warnings — this is the
    intended detection of the production failure mode
  • MetricsRecorder.RecordReplicationHealth replaced by RecordReplicationLag (healthy signal moved to RecordReplicaSet)

Testing

  • make lint — 0 issues; make test — all packages pass (incl. envtest suite)
  • Parser table tests cover: current format healthy/empty/Null/multi-database/negative-behind, legacy format, malformed rows
  • Metrics tests assert values, label sets, and series cleanup via prometheus/testutil

Rollout

Version bumped to 0.2.0. Helm chart 0.1.4 (appVersion bump only) and infra canary (single cluster via per-cluster values override; collector scrape config added
for the one cluster missing it) go in separate PRs.

Parse the v2.21+/v3.x SHOW REPLICAS format (with legacy fallback) and
classify empty data_info as invalid — a registered, heartbeating replica
that streams no data is now reported unhealthy. Expose per-replica
health, status, behind-count, and main-vs-replica vertex/edge drift as
Prometheus metrics, and require all registered replicas to be streaming
for the ReplicationHealthy condition.
Add Issues 5-7 (epoch_id deletion by init container, missing replica
remediation, -read service selecting broken replicas) and correct
Issue 1's premise: empty data_info means streaming never engaged, not
a parser artifact.
@nitinstp23 nitinstp23 requested review from irfn and rnjn June 4, 2026 17:20
@nitinstp23 nitinstp23 self-assigned this Jun 4, 2026
@nitinstp23 nitinstp23 merged commit ac845fc into main Jun 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant