Fix pdb flake on aks mgmt cluster create and delete flow #5883

willie-yao · 2025-09-29T01:14:34Z

What type of PR is this?
/kind flake

What this PR does / why we need it:
This PR fixes a bug with pdb error: Message: Cordoned nodes have used all surge nodes but there are more nodes to be upgraded. Please fix your PDBs blocking node drain and kindly retry upgrade operation. It also fixes the delete flow where the aks management cluster wasn't being deleted when the test fails, only when it succeeds.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

squashed commits
includes documentation
adds unit tests
cherry-pick candidate

Release note:

NONE

k8s-ci-robot · 2025-09-29T01:14:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nojnhuh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-09-29T01:23:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.93%. Comparing base (ed22c64) to head (cb473b3).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5883      +/-   ##
==========================================
- Coverage   46.94%   46.93%   -0.01%     
==========================================
  Files         279      279              
  Lines       29687    29688       +1     
==========================================
- Hits        13936    13935       -1     
- Misses      14938    14940       +2     
  Partials      813      813

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jackfrancis · 2025-09-29T19:35:33Z

e2e.mk

+		-e2e.skip-resource-cleanup=$(SKIP_CLEANUP) -e2e.use-existing-cluster=$(SKIP_CREATE_MGMT_CLUSTER) $(E2E_ARGS)
+
+.PHONY: test-e2e-cleanup
+test-e2e-cleanup: ## Clean up e2e test resources.


since we are running this in the context of set -e (exit on any error) why are we also defensively doing || true for every step?

Oops good catch! I think that the original idea is that since we are in the deletion phase, we dont want any resources left over even if a failure is encountered here, but I think it's better to remove it.

jackfrancis · 2025-09-29T21:11:03Z

e2e.mk

 	fi

+.PHONY: test-e2e-run
+test-e2e-run: ## Run e2e tests.


So the main changes here appear to be

don't exit the whole thing if one step in the actual E2E run fails

do cleanup as a separate step where we do exit if any step in the cleanup fails

Yup exactly. This helps cleanup any aks management clusters that are left over. Without this new flow, if the test fails the aks management cluster is not cleaned up

My intuition would be to keep the "don't exit on any one single error" context during cleanup as well. But maybe I'm missing something? Why do we want to be tolerant of errors during E2E but not during cleanup?

Oh I'm confused because I also agree that we should not exit on any one single error during cleanup. I just changed it because I thought your comment was suggesting otherwise: #5883 (comment)

jackfrancis · 2025-09-30T22:41:19Z

scripts/ci-e2e.sh


 capz::ci-e2e::cleanup() {
    "${REPO_ROOT}/hack/log/redact.sh" || true
+    make cleanup-workload-identity || true


Now that we've moved these out I would (1) make a new test-e2e-run-cleanup make target (2) move all of the cleanup tasks from e2e.mk test-e2e-run into it and (3) add a $(MAKE) test-e2e-run-cleanup command at the end of each of the following jobs that invoke test-e2e-run:

test-e2e-skip-push

test-e2e-skip-build-and-push

test-e2e-custom-image

I implemented this suggestion here! lmk what you think a701ca0

yeah, though I still think we need to update scripts/ci-e2e.sh so that the new make test-e2e-run-cleanup target is run during the capz::ci-e2e::cleanup function that we trap on exit

yeah, though I still think we need to update scripts/ci-e2e.sh so that the new make test-e2e-run-cleanup target is run during the capz::ci-e2e::cleanup function that we trap on exit

This resulted in the cleanup being invoked twice, so I think we should just keep it out of ci-e2e.sh?

willie-yao · 2025-10-02T21:11:35Z

/retest

jackfrancis · 2025-10-03T19:03:22Z

/label tide/merge-method-squash

github-project-automation bot added this to CAPZ Planning Sep 29, 2025

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Sep 29, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Sep 29, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2025

k8s-ci-robot requested review from marosset and nojnhuh September 29, 2025 01:14

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2025

willie-yao force-pushed the test-pdb branch from 51ded6f to d631485 Compare September 29, 2025 01:15

jackfrancis reviewed Sep 29, 2025

View reviewed changes

willie-yao force-pushed the test-pdb branch 2 times, most recently from 2670b8e to 7408512 Compare September 29, 2025 20:46

jackfrancis reviewed Sep 29, 2025

View reviewed changes

willie-yao force-pushed the test-pdb branch from ce0dd77 to 7bbb05e Compare September 30, 2025 19:59

jackfrancis reviewed Sep 30, 2025

View reviewed changes

willie-yao force-pushed the test-pdb branch from 7b17e13 to 8b2c84f Compare October 1, 2025 22:23

Fix pdb flake and delete flow for aks mgmt cluster

62684b0

willie-yao force-pushed the test-pdb branch from b15445f to 62684b0 Compare October 3, 2025 16:27

fix syntax

86014f8

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 3, 2025

willie-yao force-pushed the test-pdb branch from 995b764 to 68d7673 Compare October 3, 2025 22:39

Remove cleanup step

c08a2d7

willie-yao force-pushed the test-pdb branch 2 times, most recently from 295204f to c08a2d7 Compare October 9, 2025 20:59

fix 2

cb473b3

Fix pdb flake on aks mgmt cluster create and delete flow #5883

Are you sure you want to change the base?

Fix pdb flake on aks mgmt cluster create and delete flow #5883

Conversation

willie-yao commented Sep 29, 2025

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

codecov bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willie-yao commented Oct 2, 2025

Uh oh!

jackfrancis commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Sep 29, 2025 •

edited

Loading