fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics by nitinstp23 · Pull Request #2 · base-14/memgraph-operator

nitinstp23 · 2026-06-04T17:20:39Z

Summary

While debugging intermittent empty query results in a production cluster, we found a replica that had been registered and heartbeating for 104+ days while
replicating zero data (SHOW REPLICAS → data_info: {}). The operator never surfaced this: the parser read the wrong column, and both the ReplicationHealthy
condition and existing metrics were based on registration count, not actual data streaming.

Changes

Parser fix (FIXES-REQUIRED.md Issue 1) — internal/memgraph/client.go

Parse the current Memgraph format (name | socket_address | sync_mode | system_info | data_info), with per-row fallback for the legacy pre-v2.21 format
Parse data_info into per-database {behind, status, ts}; empty data_info ({}/Null) classifies as invalid — it means data streaming never engaged, not a parse
artifact
ReplicaInfo gains SocketAddress/SyncMode/DataInfo/Behind/Status/Timestamp plus IsHealthy() / DataInfoPresent()

New replication metrics — internal/controller/metrics.go

Per-replica: memgraph_replica_healthy, memgraph_replica_status, memgraph_replica_behind_count, memgraph_replica_last_confirmed_timestamp_seconds,
memgraph_replica_data_info_present
Rollups: memgraph_cluster_replicas_total, memgraph_cluster_replicas_healthy_total; memgraph_replication_healthy now means "all registered replicas streaming"
Drift: memgraph_replication_vertex_drift / memgraph_replication_edge_drift (main minus replica, from existing SHOW STORAGE INFO collection)
Stale series cleanup via DeletePartialMatch on replica removal and cluster deletion

Condition correctness — ReplicationHealthy now requires every registered replica to be streaming (healthy >= registered), not just registered >= expected; message
reports both counts

Breaking / behavioral notes

Clusters with registered-but-not-streaming replicas will now (correctly) report ReplicationHealthy=False and emit unhealthy-replica warnings — this is the
intended detection of the production failure mode
MetricsRecorder.RecordReplicationHealth replaced by RecordReplicationLag (healthy signal moved to RecordReplicaSet)

Testing

make lint — 0 issues; make test — all packages pass (incl. envtest suite)
Parser table tests cover: current format healthy/empty/Null/multi-database/negative-behind, legacy format, malformed rows
Metrics tests assert values, label sets, and series cleanup via prometheus/testutil

Rollout

Version bumped to 0.2.0. Helm chart 0.1.4 (appVersion bump only) and infra canary (single cluster via per-cluster values override; collector scrape config added
for the one cluster missing it) go in separate PRs.

Parse the v2.21+/v3.x SHOW REPLICAS format (with legacy fallback) and classify empty data_info as invalid — a registered, heartbeating replica that streams no data is now reported unhealthy. Expose per-replica health, status, behind-count, and main-vs-replica vertex/edge drift as Prometheus metrics, and require all registered replicas to be streaming for the ReplicationHealthy condition.

Add Issues 5-7 (epoch_id deletion by init container, missing replica remediation, -read service selecting broken replicas) and correct Issue 1's premise: empty data_info means streaming never engaged, not a parser artifact.

nitinstp23 added 3 commits June 4, 2026 22:44

🔖 chore: bump version to 0.2.0

7d549ed

nitinstp23 requested review from irfn and rnjn June 4, 2026 17:20

nitinstp23 self-assigned this Jun 4, 2026

nitinstp23 merged commit ac845fc into main Jun 5, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics#2

fix: parse SHOW REPLICAS data_info correctly; add replication health + drift metrics#2
nitinstp23 merged 3 commits into
mainfrom
nitin/replication-health-metrics

nitinstp23 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nitinstp23 commented Jun 4, 2026

Summary

Changes

Breaking / behavioral notes

Testing

Rollout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant