Skip to content

feat: harden add-node workflow for data safety, add failover test#57

Merged
mmols merged 5 commits into
mainfrom
feat/PLAT-616/add-node-hardening
May 21, 2026
Merged

feat: harden add-node workflow for data safety, add failover test#57
mmols merged 5 commits into
mainfrom
feat/PLAT-616/add-node-hardening

Conversation

@mmols

@mmols mmols commented May 20, 2026

Copy link
Copy Markdown
Member

This PR hardens the zero-downtime add-node populate flow with PeerCatchup, ReplicationOriginAdvance, and a WaitForSyncEvent fix that aligns with recent updates in the Control Plane project: pgEdge/control-plane#385.

In addition, it adds TestUnplannedFailover and an origin_advanced_on_n3 assertion to the existing zero-downtime add-node test to improve integration test coverage.

Each commit can be reviewed independently.

mmols added 2 commits May 20, 2026 15:31
- Add PeerCatchup resource: waits for the source to actually apply peer
  commits (not just receive WAL) before the source->new COPY starts.
  Closes a data-loss window during add-node.
- Add ReplicationOriginAdvance resource: keeps the subscriber-side
  origin in lockstep with the provider-side slot, so the apply worker
  resumes from the right LSN instead of replaying WAL from 0/0.
- ReplicationSlotAdvanceFromCTS now records the LSN it advanced to
  (empty when skipped) so origin advance can no-op cleanly.
- WaitForSyncEvent treats disabled/down subscription states as
  transient with backoff polling instead of failing fast.
- Wire the new resources into addPopulateResources
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8d0531e-d760-4554-a60a-e2ccd8dca497

📥 Commits

Reviewing files that changed from the base of the PR and between 2a64c99 and 1071b29.

📒 Files selected for processing (3)
  • internal/spock/replication_slot_advance.go
  • internal/spock/spock_test.go
  • test/integration/replication_helpers_test.go

📝 Walkthrough

Walkthrough

This PR introduces two new Spock resource types to refine replication setup gating: PeerCatchup polls peer-apply progress before COPY progresses, and ReplicationOriginAdvance advances the subscriber origin after slot advance. Slot-advance logic now outputs the target LSN for downstream use. Subscription dependencies shift to gate peer→new subscriptions on origin advance rather than slot advance. Polling backoff is added for transient states. Unit tests validate the new resources and wiring; integration tests verify unplanned failover scenarios.

Changes

Spock Replication Resources and Orchestration

Layer / File(s) Summary
Resource Type Constants and Contracts
internal/spock/desired.go
Two new resource type constants define ResourceTypeReplicationOriginAdvance and ResourceTypePeerCatchup for orchestration identification.
PeerCatchup Resource Implementation
internal/spock/peer_catchup.go
PeerCatchup resource type with constructor, resource metadata methods, and an ephemeral Create that polls spock.progress until peer remote_lsn reaches the target LSN from a paired SyncEvent.
ReplicationSlotAdvanceFromCTS Output Field and Logic
internal/spock/replication_slot_advance.go
Slot-advance resource exposes AdvancedToLSN field (empty when skipped) for downstream communication; Create now reads current restart_lsn, skips when target is at or before current position, and conditionally advances based on confirmed_flush_lsn.
ReplicationOriginAdvance Resource Implementation
internal/spock/replication_origin_advance.go
ReplicationOriginAdvance resource type that conditionally creates and advances pg_replication_origin to match AdvancedToLSN from a paired slot-advance resource; depends on ReplicationSlotAdvanceFromCTS and is ephemeral.
ComputeDesired Wiring and Polling Backoff
internal/spock/desired.go, internal/spock/wait_for_sync_event.go
Subscription dependencies shift so peer→new subscriptions wait on ReplicationOriginAdvance instead of slot-advance; populate chain creates PeerCatchup resources gating COPY on peer apply progress; sync-event polling adds backoff for transient disabled/down states instead of busy-looping.
Unit Tests for Resources and Dependency Wiring
internal/spock/spock_test.go
New tests validate ReplicationOriginAdvance and PeerCatchup identifiers, dependencies, and ephemeral behavior; ComputeDesired tests confirm new resources are in populate plans and subscription dependencies use the updated gates.

Integration Testing Infrastructure and Failover Validation

Layer / File(s) Summary
Replication Test Helper Utilities
test/integration/replication_helpers_test.go
waitForReplication triggers sync events and polls all pods for event delivery; verifyMeshReplication inserts per-pod test rows and validates each source replicates to every destination (skipping self); podToNode converts pod names to node identifiers.
Failover Test Infrastructure and Build Target
test/integration/failover_helpers.go, test/Makefile
Failover helpers: getPrimaryPod locates CNPG primary via kubectl label, killPrimary force-deletes the primary pod, waitForStandbySlotSync polls logical slot sync on standbys, waitForPromotion awaits new primary election and readiness; test-failover Make target runs the test.
Unplanned Failover Integration Test
test/integration/failover_test.go
TestUnplannedFailover installs a 2-node cluster, seeds test data, verifies replication to standby, kills the primary, waits for promotion, then validates no data loss on new primary, forward replication from the old node, reverse replication from new primary, healthy subscriptions, active replication slots, and full-mesh replication.
Test Updates and Configuration
test/integration/nodes_test.go, test/integration/testdata/distributed-2node-2instance-values.yaml
Refactors TestNodesAddNode full-mesh verification to use the shared helper; adds origin_advanced_on_n3 subtest to validate origin progress is non-zero; removes duplicated local helpers; provides 2-node cluster testdata YAML.

Poem

A rabbit bounds through replication streams,
Waiting on peers to catch up with dreams,
Origins nudge forward to match the line,
Tests dance through failover, all in time,
🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: hardening the add-node workflow for data safety and adding a failover test.
Description check ✅ Passed The description is directly related to the changeset, explaining the hardening changes and integration test additions referenced in the code.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/PLAT-616/add-node-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

@mmols mmols changed the title feat: harden add-node workflow for data safety, test failover feat: harden add-node workflow for data safety, add failover test May 20, 2026
@codacy-production

codacy-production Bot commented May 20, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 2 medium

Results:
2 new issues

Category Results
Complexity 2 medium

View in Codacy

🟢 Metrics 108 complexity · 47 duplication

Metric Results
Complexity 108
Duplication 47

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/spock/replication_slot_advance.go`:
- Around line 76-78: Don't clear r.AdvancedToLSN on entry; instead ensure
r.AdvancedToLSN is set to the effective slot LSN on every successful exit
(including idempotent/no-op and retry paths) so the origin advance step sees the
moved slot. Concretely: remove or stop resetting r.AdvancedToLSN = "" at the
start of the reconciliation and, after determining the provider slot LSN (the
value you already read when checking commit_ts/current slot or after calling
pg_replication_slot_advance), assign that LSN into r.AdvancedToLSN before
returning from all success branches (including the paths where
pg_replication_slot_advance was a no-op or when ReplicationOriginAdvance failed
but the slot already moved). Apply the same change for the second similar block
(the code around the later 113-149 region) so both idempotent and retry exits
carry forward the effective LSN.
- Around line 113-126: The current code compares WAL positions using Go string
ordering (`if targetLSN <= currentLSN`) which is wrong; replace that in the
block after reading currentLSN by performing a pg_lsn-aware SQL comparison via
r.conn.QueryRow instead of lexicographic comparison: call QueryRow with a
statement like "SELECT $1::pg_lsn <= $2::pg_lsn" (bind targetLSN and
currentLSN), Scan the result into a bool (e.g., alreadyAtOrBeyond) and use that
bool to decide to log and return; keep the existing variables currentLSN and
targetLSN and the surrounding error handling.

In `@internal/spock/spock_test.go`:
- Around line 529-535: The current assertion only checks d.Type ==
ResourceTypeReplicationOriginAdvance and risks matching the wrong edge; update
the loop in the test to assert both d.Type ==
ResourceTypeReplicationOriginAdvance and d.ID == "n2_n3" (use the existing
variable names foundAdvance and d.ID) so the test specifically verifies the
replication origin advance edge for the expected resource ID "n2_n3".

In `@test/integration/replication_helpers_test.go`:
- Around line 76-97: The shared ctx/cancel used by all subtests can expire
mid-run causing cascading flakiness; inside the t.Run closure create a fresh
per-subtest timeout context (e.g., subCtx, subCancel :=
context.WithTimeout(context.Background(), 60*time.Second)) and use that subCtx
when calling wait.Until and testKube.ExecSQL, then defer subCancel in the
subtest to avoid leaks; keep the existing t.Run, wait.Until, and
testKube.ExecSQL calls but replace references to the outer ctx with the new
per-subtest context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7489300-23ae-43bd-ad06-e5dffbf86ce7

📥 Commits

Reviewing files that changed from the base of the PR and between 1b331bb and 2a64c99.

📒 Files selected for processing (12)
  • internal/spock/desired.go
  • internal/spock/peer_catchup.go
  • internal/spock/replication_origin_advance.go
  • internal/spock/replication_slot_advance.go
  • internal/spock/spock_test.go
  • internal/spock/wait_for_sync_event.go
  • test/Makefile
  • test/integration/failover_helpers.go
  • test/integration/failover_test.go
  • test/integration/nodes_test.go
  • test/integration/replication_helpers_test.go
  • test/integration/testdata/distributed-2node-2instance-values.yaml

Comment thread internal/spock/replication_slot_advance.go
Comment thread internal/spock/replication_slot_advance.go
Comment thread internal/spock/spock_test.go Outdated
Comment thread test/integration/replication_helpers_test.go Outdated

@tsivaprasad tsivaprasad left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and aligns well with the Control Plane and Spock 5.0.8 ZODAN workflow.

Verification:

make test-failover

/Library/Developer/CommandLineTools/usr/bin/make -C /Users/sivat/projects/pgedge-helm/pgedge-helm docker-build-dev
docker buildx bake dev
[+] Building 1.3s (15/15) FINISHED                                                              docker:desktop-linux
 => [internal] load local bake definitions                                                                      0.0s
 => => reading docker-bake.hcl 1.12kB / 1.12kB                                                                  0.0s
 => [internal] load build definition from Dockerfile                                                            0.0s
 => => transferring dockerfile: 514B                                                                            0.0s
 => [internal] load metadata for docker.io/library/golang:1.25                                                  1.0s
 => [internal] load .dockerignore                                                                               0.0s
 => => transferring context: 2B                                                                                 0.0s
 => [builder 1/7] FROM docker.io/library/golang:1.25@sha256:cd05a378aaf011e8056745363e5c40f4f2bef0fa4d9bf19b9c  0.0s
 => [internal] load build context                                                                               0.0s
 => => transferring context: 1.84kB                                                                             0.0s
 => CACHED [builder 2/7] WORKDIR /build                                                                         0.0s
 => CACHED [builder 3/7] COPY go.mod go.sum ./                                                                  0.0s
 => CACHED [builder 4/7] RUN go mod download                                                                    0.0s
 => CACHED [builder 5/7] COPY cmd/ cmd/                                                                         0.0s
 => CACHED [builder 6/7] COPY internal/ internal/                                                               0.0s
 => CACHED [builder 7/7] RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o /init-spock ./cmd/init-spock     0.0s
 => CACHED [stage-1 1/2] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/                 0.0s
 => CACHED [stage-1 2/2] COPY --from=builder /init-spock /init-spock                                            0.0s
 => exporting to image                                                                                          0.0s
 => => exporting layers                                                                                         0.0s
 => => writing image sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178                    0.0s
 => => naming to docker.io/library/pgedge-helm-utils:dev                                                        0.0s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/lnbyyap8d6l8spmtox5h7nia6
kind load docker-image pgedge-helm-utils:dev --name pgedge-test
Image: "pgedge-helm-utils:dev" with ID "sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178" found to be already present on all nodes.
cd /Users/sivat/projects/pgedge-helm/pgedge-helm && go test -tags integration -v -timeout 30m \
                -run "TestUnplannedFailover" ./test/integration/...
=== RUN   TestUnplannedFailover
    failover_test.go:61: standby pgedge-n1-2 has logical spk_* slot synced
    failover_test.go:61: standby pgedge-n2-2 has logical spk_* slot synced
    failover_test.go:64: force-deleted primary pod pgedge-n1-1
    failover_test.go:65: new primary elected: pgedge-n1-2
=== RUN   TestUnplannedFailover/no_data_loss
=== RUN   TestUnplannedFailover/forward_replication
=== RUN   TestUnplannedFailover/reverse_replication
=== RUN   TestUnplannedFailover/subscriptions_healthy
=== RUN   TestUnplannedFailover/subscriptions_healthy/n1
=== RUN   TestUnplannedFailover/subscriptions_healthy/n2
=== RUN   TestUnplannedFailover/slots_active
=== RUN   TestUnplannedFailover/slots_active/n1
=== RUN   TestUnplannedFailover/slots_active/n2
=== RUN   TestUnplannedFailover/full_mesh_reoplication
=== RUN   TestUnplannedFailover/full_mesh_reoplication/pgedge-n1-2_to_pgedge-n2-1
=== RUN   TestUnplannedFailover/full_mesh_reoplication/pgedge-n2-1_to_pgedge-n1-2
--- PASS: TestUnplannedFailover (217.13s)
    --- PASS: TestUnplannedFailover/no_data_loss (0.07s)
    --- PASS: TestUnplannedFailover/forward_replication (0.18s)
    --- PASS: TestUnplannedFailover/reverse_replication (4.18s)
    --- PASS: TestUnplannedFailover/subscriptions_healthy (0.18s)
        --- PASS: TestUnplannedFailover/subscriptions_healthy/n1 (0.09s)
        --- PASS: TestUnplannedFailover/subscriptions_healthy/n2 (0.08s)
    --- PASS: TestUnplannedFailover/slots_active (0.18s)
        --- PASS: TestUnplannedFailover/slots_active/n1 (0.09s)
        --- PASS: TestUnplannedFailover/slots_active/n2 (0.09s)
    --- PASS: TestUnplannedFailover/full_mesh_reoplication (0.35s)
        --- PASS: TestUnplannedFailover/full_mesh_reoplication/pgedge-n1-2_to_pgedge-n2-1 (0.08s)
        --- PASS: TestUnplannedFailover/full_mesh_reoplication/pgedge-n2-1_to_pgedge-n1-2 (0.09s)
PASS
ok      github.com/pgEdge/pgedge-helm/test/integration  217.632s

test git:(feat/PLAT-616/add-node-hardening) make test-nodes

/Library/Developer/CommandLineTools/usr/bin/make -C /Users/sivat/projects/pgedge-helm/pgedge-helm docker-build-dev
docker buildx bake dev
[+] Building 1.6s (15/15) FINISHED                                                              docker:desktop-linux
 => [internal] load local bake definitions                                                                      0.0s
 => => reading docker-bake.hcl 1.12kB / 1.12kB                                                                  0.0s
 => [internal] load build definition from Dockerfile                                                            0.0s
 => => transferring dockerfile: 514B                                                                            0.0s
 => [internal] load metadata for docker.io/library/golang:1.25                                                  1.3s
 => [internal] load .dockerignore                                                                               0.0s
 => => transferring context: 2B                                                                                 0.0s
 => [builder 1/7] FROM docker.io/library/golang:1.25@sha256:cd05a378aaf011e8056745363e5c40f4f2bef0fa4d9bf19b9c  0.0s
 => [internal] load build context                                                                               0.0s
 => => transferring context: 1.84kB                                                                             0.0s
 => CACHED [builder 2/7] WORKDIR /build                                                                         0.0s
 => CACHED [builder 3/7] COPY go.mod go.sum ./                                                                  0.0s
 => CACHED [builder 4/7] RUN go mod download                                                                    0.0s
 => CACHED [builder 5/7] COPY cmd/ cmd/                                                                         0.0s
 => CACHED [builder 6/7] COPY internal/ internal/                                                               0.0s
 => CACHED [builder 7/7] RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o /init-spock ./cmd/init-spock     0.0s
 => CACHED [stage-1 1/2] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/                 0.0s
 => CACHED [stage-1 2/2] COPY --from=builder /init-spock /init-spock                                            0.0s
 => exporting to image                                                                                          0.0s
 => => exporting layers                                                                                         0.0s
 => => writing image sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178                    0.0s
 => => naming to docker.io/library/pgedge-helm-utils:dev                                                        0.0s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/t5g7v04tsflry15gtzjqh7q17
kind load docker-image pgedge-helm-utils:dev --name pgedge-test
Image: "pgedge-helm-utils:dev" with ID "sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178" found to be already present on all nodes.
cd /Users/sivat/projects/pgedge-helm/pgedge-helm && go test -tags integration -v -timeout 30m \
                -run "TestNodes" ./test/integration/...
=== RUN   TestNodesAddNode
=== RUN   TestNodesAddNode/upgrade_rejects_new_node_without_bootstrap_mode
=== RUN   TestNodesAddNode/upgrade_rejects_rebootstrap_existing_node
=== RUN   TestNodesAddNode/n3_cluster_healthy
=== RUN   TestNodesAddNode/init_spock_succeeds_after_upgrade
=== RUN   TestNodesAddNode/n3_has_existing_data
=== RUN   TestNodesAddNode/full_mesh_replication
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n2-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n3-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n1-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n3-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n1-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n2-1
=== RUN   TestNodesAddNode/idempotent_rerun_on_3_nodes
--- PASS: TestNodesAddNode (66.96s)
    --- PASS: TestNodesAddNode/upgrade_rejects_new_node_without_bootstrap_mode (0.12s)
    --- PASS: TestNodesAddNode/upgrade_rejects_rebootstrap_existing_node (0.06s)
    --- PASS: TestNodesAddNode/n3_cluster_healthy (0.38s)
    --- PASS: TestNodesAddNode/init_spock_succeeds_after_upgrade (0.15s)
    --- PASS: TestNodesAddNode/n3_has_existing_data (0.14s)
    --- PASS: TestNodesAddNode/full_mesh_replication (0.81s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n2-1 (0.09s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n3-1 (0.09s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n1-1 (0.08s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n3-1 (0.10s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n1-1 (0.08s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n2-1 (0.09s)
    --- PASS: TestNodesAddNode/idempotent_rerun_on_3_nodes (4.22s)
=== RUN   TestNodesAddNodeZeroDowntime
=== RUN   TestNodesAddNodeZeroDowntime/n3_cluster_healthy
=== RUN   TestNodesAddNodeZeroDowntime/init_spock_succeeds
=== RUN   TestNodesAddNodeZeroDowntime/origin_advanced_on_n3
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n1
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n2
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n1-1
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n2-1
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n1-1_subscriptions
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n2-1_subscriptions
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n3-1_subscriptions
--- PASS: TestNodesAddNodeZeroDowntime (68.31s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_cluster_healthy (0.14s)
    --- PASS: TestNodesAddNodeZeroDowntime/init_spock_succeeds (0.16s)
    --- PASS: TestNodesAddNodeZeroDowntime/origin_advanced_on_n3 (0.08s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data (0.52s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n1 (0.26s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n2 (0.25s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally (1.14s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n1-1 (0.09s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n2-1 (0.08s)
    --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established (0.33s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n1-1_subscriptions (0.10s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n2-1_subscriptions (0.11s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n3-1_subscriptions (0.13s)
=== RUN   TestNodesRemoveNode
=== RUN   TestNodesRemoveNode/remaining_clusters_healthy
=== RUN   TestNodesRemoveNode/init_spock_succeeds_after_removal
=== RUN   TestNodesRemoveNode/spock_node_n3_removed
=== RUN   TestNodesRemoveNode/spock_node_n3_removed/pgedge-n1-1
=== RUN   TestNodesRemoveNode/spock_node_n3_removed/pgedge-n2-1
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n1-1
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n2-1
=== RUN   TestNodesRemoveNode/replication_still_works
--- PASS: TestNodesRemoveNode (39.43s)
    --- PASS: TestNodesRemoveNode/remaining_clusters_healthy (0.31s)
    --- PASS: TestNodesRemoveNode/init_spock_succeeds_after_removal (0.15s)
    --- PASS: TestNodesRemoveNode/spock_node_n3_removed (0.20s)
        --- PASS: TestNodesRemoveNode/spock_node_n3_removed/pgedge-n1-1 (0.11s)
        --- PASS: TestNodesRemoveNode/spock_node_n3_removed/pgedge-n2-1 (0.09s)
    --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed (0.16s)
        --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n1-1 (0.09s)
        --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n2-1 (0.08s)
    --- PASS: TestNodesRemoveNode/replication_still_works (0.25s)
PASS
ok      github.com/pgEdge/pgedge-helm/test/integration  175.069s
test git:(feat/PLAT-616/add-node-hardening) make test-run RUN="TestUnplannedFailover|TestNodes"

/Library/Developer/CommandLineTools/usr/bin/make -C /Users/sivat/projects/pgedge-helm/pgedge-helm docker-build-dev
docker buildx bake dev
[+] Building 1.2s (15/15) FINISHED                                                              docker:desktop-linux
 => [internal] load local bake definitions                                                                      0.0s
 => => reading docker-bake.hcl 1.12kB / 1.12kB                                                                  0.0s
 => [internal] load build definition from Dockerfile                                                            0.0s
 => => transferring dockerfile: 514B                                                                            0.0s
 => [internal] load metadata for docker.io/library/golang:1.25                                                  1.0s
 => [internal] load .dockerignore                                                                               0.0s
 => => transferring context: 2B                                                                                 0.0s
 => [builder 1/7] FROM docker.io/library/golang:1.25@sha256:cd05a378aaf011e8056745363e5c40f4f2bef0fa4d9bf19b9c  0.0s
 => [internal] load build context                                                                               0.0s
 => => transferring context: 1.84kB                                                                             0.0s
 => CACHED [builder 2/7] WORKDIR /build                                                                         0.0s
 => CACHED [builder 3/7] COPY go.mod go.sum ./                                                                  0.0s
 => CACHED [builder 4/7] RUN go mod download                                                                    0.0s
 => CACHED [builder 5/7] COPY cmd/ cmd/                                                                         0.0s
 => CACHED [builder 6/7] COPY internal/ internal/                                                               0.0s
 => CACHED [builder 7/7] RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o /init-spock ./cmd/init-spock     0.0s
 => CACHED [stage-1 1/2] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/                 0.0s
 => CACHED [stage-1 2/2] COPY --from=builder /init-spock /init-spock                                            0.0s
 => exporting to image                                                                                          0.0s
 => => exporting layers                                                                                         0.0s
 => => writing image sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178                    0.0s
 => => naming to docker.io/library/pgedge-helm-utils:dev                                                        0.0s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/ymrr9e909c9q02nase1ljlh2m
kind load docker-image pgedge-helm-utils:dev --name pgedge-test
Image: "pgedge-helm-utils:dev" with ID "sha256:719e1c984d56467df2ab7636028678b67a56faf2c606575af9dcfadce5bcf178" found to be already present on all nodes.
cd /Users/sivat/projects/pgedge-helm/pgedge-helm && go test -tags integration -v -timeout 30m \
                -run "TestUnplannedFailover|TestNodes" ./test/integration/...
=== RUN   TestUnplannedFailover
    failover_test.go:61: standby pgedge-n1-2 has logical spk_* slot synced
    failover_test.go:61: standby pgedge-n2-2 has logical spk_* slot synced
    failover_test.go:64: force-deleted primary pod pgedge-n1-1
    failover_test.go:65: new primary elected: pgedge-n1-2
=== RUN   TestUnplannedFailover/no_data_loss
=== RUN   TestUnplannedFailover/forward_replication
=== RUN   TestUnplannedFailover/reverse_replication
=== RUN   TestUnplannedFailover/subscriptions_healthy
=== RUN   TestUnplannedFailover/subscriptions_healthy/n1
=== RUN   TestUnplannedFailover/subscriptions_healthy/n2
=== RUN   TestUnplannedFailover/slots_active
=== RUN   TestUnplannedFailover/slots_active/n1
=== RUN   TestUnplannedFailover/slots_active/n2
=== RUN   TestUnplannedFailover/full_mesh_reoplication
=== RUN   TestUnplannedFailover/full_mesh_reoplication/pgedge-n1-2_to_pgedge-n2-1
=== RUN   TestUnplannedFailover/full_mesh_reoplication/pgedge-n2-1_to_pgedge-n1-2
--- PASS: TestUnplannedFailover (115.01s)
    --- PASS: TestUnplannedFailover/no_data_loss (0.09s)
    --- PASS: TestUnplannedFailover/forward_replication (0.17s)
    --- PASS: TestUnplannedFailover/reverse_replication (8.23s)
    --- PASS: TestUnplannedFailover/subscriptions_healthy (0.18s)
        --- PASS: TestUnplannedFailover/subscriptions_healthy/n1 (0.09s)
        --- PASS: TestUnplannedFailover/subscriptions_healthy/n2 (0.08s)
    --- PASS: TestUnplannedFailover/slots_active (0.16s)
        --- PASS: TestUnplannedFailover/slots_active/n1 (0.08s)
        --- PASS: TestUnplannedFailover/slots_active/n2 (0.08s)
    --- PASS: TestUnplannedFailover/full_mesh_reoplication (0.35s)
        --- PASS: TestUnplannedFailover/full_mesh_reoplication/pgedge-n1-2_to_pgedge-n2-1 (0.09s)
        --- PASS: TestUnplannedFailover/full_mesh_reoplication/pgedge-n2-1_to_pgedge-n1-2 (0.09s)
=== RUN   TestNodesAddNode
=== RUN   TestNodesAddNode/upgrade_rejects_new_node_without_bootstrap_mode
=== RUN   TestNodesAddNode/upgrade_rejects_rebootstrap_existing_node
=== RUN   TestNodesAddNode/n3_cluster_healthy
=== RUN   TestNodesAddNode/init_spock_succeeds_after_upgrade
=== RUN   TestNodesAddNode/n3_has_existing_data
=== RUN   TestNodesAddNode/full_mesh_replication
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n1-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n2-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n2-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n3-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n1-1
=== RUN   TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n3-1
=== RUN   TestNodesAddNode/idempotent_rerun_on_3_nodes
--- PASS: TestNodesAddNode (70.09s)
    --- PASS: TestNodesAddNode/upgrade_rejects_new_node_without_bootstrap_mode (0.07s)
    --- PASS: TestNodesAddNode/upgrade_rejects_rebootstrap_existing_node (0.06s)
    --- PASS: TestNodesAddNode/n3_cluster_healthy (0.44s)
    --- PASS: TestNodesAddNode/init_spock_succeeds_after_upgrade (0.15s)
    --- PASS: TestNodesAddNode/n3_has_existing_data (0.12s)
    --- PASS: TestNodesAddNode/full_mesh_replication (0.83s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n1-1 (0.10s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n3-1_to_pgedge-n2-1 (0.09s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n2-1 (0.10s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n1-1_to_pgedge-n3-1 (0.09s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n1-1 (0.08s)
        --- PASS: TestNodesAddNode/full_mesh_replication/pgedge-n2-1_to_pgedge-n3-1 (0.09s)
    --- PASS: TestNodesAddNode/idempotent_rerun_on_3_nodes (4.14s)
=== RUN   TestNodesAddNodeZeroDowntime
=== RUN   TestNodesAddNodeZeroDowntime/n3_cluster_healthy
=== RUN   TestNodesAddNodeZeroDowntime/init_spock_succeeds
=== RUN   TestNodesAddNodeZeroDowntime/origin_advanced_on_n3
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n1
=== RUN   TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n2
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n1-1
=== RUN   TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n2-1
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n1-1_subscriptions
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n2-1_subscriptions
=== RUN   TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n3-1_subscriptions
--- PASS: TestNodesAddNodeZeroDowntime (71.41s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_cluster_healthy (0.16s)
    --- PASS: TestNodesAddNodeZeroDowntime/init_spock_succeeds (0.18s)
    --- PASS: TestNodesAddNodeZeroDowntime/origin_advanced_on_n3 (0.08s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data (0.50s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n1 (0.26s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_has_all_data/test_zdt_n2 (0.24s)
    --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally (1.14s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n1-1 (0.09s)
        --- PASS: TestNodesAddNodeZeroDowntime/n3_replicates_bidirectionally/pgedge-n2-1 (0.09s)
    --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established (0.26s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n1-1_subscriptions (0.08s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n2-1_subscriptions (0.09s)
        --- PASS: TestNodesAddNodeZeroDowntime/full_mesh_established/pgedge-n3-1_subscriptions (0.09s)
=== RUN   TestNodesRemoveNode
=== RUN   TestNodesRemoveNode/remaining_clusters_healthy
=== RUN   TestNodesRemoveNode/init_spock_succeeds_after_removal
=== RUN   TestNodesRemoveNode/spock_node_n3_removed
=== RUN   TestNodesRemoveNode/spock_node_n3_removed/pgedge-n1-1
=== RUN   TestNodesRemoveNode/spock_node_n3_removed/pgedge-n2-1
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n1-1
=== RUN   TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n2-1
=== RUN   TestNodesRemoveNode/replication_still_works
--- PASS: TestNodesRemoveNode (41.38s)
    --- PASS: TestNodesRemoveNode/remaining_clusters_healthy (0.33s)
    --- PASS: TestNodesRemoveNode/init_spock_succeeds_after_removal (0.14s)
    --- PASS: TestNodesRemoveNode/spock_node_n3_removed (0.19s)
        --- PASS: TestNodesRemoveNode/spock_node_n3_removed/pgedge-n1-1 (0.10s)
        --- PASS: TestNodesRemoveNode/spock_node_n3_removed/pgedge-n2-1 (0.09s)
    --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed (0.17s)
        --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n1-1 (0.08s)
        --- PASS: TestNodesRemoveNode/subscriptions_to_n3_removed/pgedge-n2-1 (0.09s)
    --- PASS: TestNodesRemoveNode/replication_still_works (0.27s)
PASS
ok      github.com/pgEdge/pgedge-helm/test/integration  298.613s

@mmols mmols merged commit d414ea7 into main May 21, 2026
5 checks passed
@mmols mmols deleted the feat/PLAT-616/add-node-hardening branch May 21, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants