Skip to content

Conversation

willie-yao
Copy link
Contributor

What type of PR is this?
/kind flake

What this PR does / why we need it:
This PR fixes a bug with pdb error: Message: Cordoned nodes have used all surge nodes but there are more nodes to be upgraded. Please fix your PDBs blocking node drain and kindly retry upgrade operation. It also fixes the delete flow where the aks management cluster wasn't being deleted when the test fails, only when it succeeds.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nojnhuh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2025
Copy link

codecov bot commented Sep 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.93%. Comparing base (ed22c64) to head (cb473b3).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5883      +/-   ##
==========================================
- Coverage   46.94%   46.93%   -0.01%     
==========================================
  Files         279      279              
  Lines       29687    29688       +1     
==========================================
- Hits        13936    13935       -1     
- Misses      14938    14940       +2     
  Partials      813      813              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

e2e.mk Outdated
-e2e.skip-resource-cleanup=$(SKIP_CLEANUP) -e2e.use-existing-cluster=$(SKIP_CREATE_MGMT_CLUSTER) $(E2E_ARGS)

.PHONY: test-e2e-cleanup
test-e2e-cleanup: ## Clean up e2e test resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are running this in the context of set -e (exit on any error) why are we also defensively doing || true for every step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch! I think that the original idea is that since we are in the deletion phase, we dont want any resources left over even if a failure is encountered here, but I think it's better to remove it.

@willie-yao willie-yao force-pushed the test-pdb branch 2 times, most recently from 2670b8e to 7408512 Compare September 29, 2025 20:46
e2e.mk Outdated
fi

.PHONY: test-e2e-run
test-e2e-run: ## Run e2e tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main changes here appear to be

  • don't exit the whole thing if one step in the actual E2E run fails
  • do cleanup as a separate step where we do exit if any step in the cleanup fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup exactly. This helps cleanup any aks management clusters that are left over. Without this new flow, if the test fails the aks management cluster is not cleaned up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition would be to keep the "don't exit on any one single error" context during cleanup as well. But maybe I'm missing something? Why do we want to be tolerant of errors during E2E but not during cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm confused because I also agree that we should not exit on any one single error during cleanup. I just changed it because I thought your comment was suggesting otherwise: #5883 (comment)


capz::ci-e2e::cleanup() {
"${REPO_ROOT}/hack/log/redact.sh" || true
make cleanup-workload-identity || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've moved these out I would (1) make a new test-e2e-run-cleanup make target (2) move all of the cleanup tasks from e2e.mk test-e2e-run into it and (3) add a $(MAKE) test-e2e-run-cleanup command at the end of each of the following jobs that invoke test-e2e-run:

  • test-e2e-skip-push
  • test-e2e-skip-build-and-push
  • test-e2e-custom-image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this suggestion here! lmk what you think a701ca0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, though I still think we need to update scripts/ci-e2e.sh so that the new make test-e2e-run-cleanup target is run during the capz::ci-e2e::cleanup function that we trap on exit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, though I still think we need to update scripts/ci-e2e.sh so that the new make test-e2e-run-cleanup target is run during the capz::ci-e2e::cleanup function that we trap on exit

This resulted in the cleanup being invoked twice, so I think we should just keep it out of ci-e2e.sh?

@willie-yao
Copy link
Contributor Author

/retest

@jackfrancis
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 3, 2025
@willie-yao willie-yao force-pushed the test-pdb branch 2 times, most recently from 295204f to c08a2d7 Compare October 9, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants