Skip to content

test: retry RBD pool disable on transient mirror health flap#762

Open
UtkarshBhatthere wants to merge 1 commit into
mainfrom
fix/rbd-disable-mirror-health-race
Open

test: retry RBD pool disable on transient mirror health flap#762
UtkarshBhatthere wants to merge 1 commit into
mainfrom
fix/rbd-disable-mirror-health-race

Conversation

@UtkarshBhatthere

Copy link
Copy Markdown
Contributor

The "Disable RBD mirror" step runs remote_disable_rbd_mirroring right
after remote_failover_to_siteb. Failover promotes the secondary and
triggers a resync, during which pool mirror health transiently flaps to
WARNING. The daemon re-validates pool health on every non-forced
pool-level disable (replication_rbd.go: "pool replication status not
OK"), so a disable issued during that window exits 1 and fails the step.

PR #696 added a single up-front remote_wait_for_rbd_mirror_health call,
but that is a TOCTOU check: health can flap back after the wait, and each
disable perturbs health, so the one-shot wait cannot cover the
per-operation re-validation.

Add rbd_disable_retry_transient_health, which retries a pool disable only
while it is rejected with "status not OK", and route the three
pool-level disables through it. It returns immediately on success or on
any other outcome, so the negative test still observes the expected "in
Image mirroring mode" guard. Image-level disables are not health gated
and are left unchanged.

Assisted-by: hermes:claude-opus-4.8

@UtkarshBhatthere UtkarshBhatthere force-pushed the fix/rbd-disable-mirror-health-race branch from ae0297d to 90ae630 Compare June 11, 2026 09:59
The "Disable RBD mirror" step runs remote_disable_rbd_mirroring right
after remote_failover_to_siteb. Failover promotes the secondary and
triggers a resync, during which pool mirror health transiently flaps to
WARNING. The daemon re-validates pool health on every non-forced
pool-level disable (replication_rbd.go: "pool replication status not
OK"), so a disable issued during that window exits 1 and fails the step.

PR #696 added a single up-front remote_wait_for_rbd_mirror_health call,
but that is a TOCTOU check: health can flap back after the wait, and each
disable perturbs health, so the one-shot wait cannot cover the
per-operation re-validation.

Add rbd_disable_retry_transient_health, which retries a pool disable only
while it is rejected with "status not OK", and route the three
pool-level disables through it. It returns immediately on success or on
any other outcome, so the negative test still observes the expected "in
Image mirroring mode" guard. Image-level disables are not health gated
and are left unchanged.

Assisted-by: hermes:claude-opus-4.8
Signed-off-by: Utkarsh Bhatt <utkarsh_bhatt@outlook.com>
@UtkarshBhatthere UtkarshBhatthere force-pushed the fix/rbd-disable-mirror-health-race branch from 90ae630 to f491622 Compare June 11, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant