-
Notifications
You must be signed in to change notification settings - Fork 459
Fix pdb flake on aks mgmt cluster create and delete flow #5883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
51ded6f
to
d631485
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5883 +/- ##
==========================================
- Coverage 46.94% 46.93% -0.01%
==========================================
Files 279 279
Lines 29687 29688 +1
==========================================
- Hits 13936 13935 -1
- Misses 14938 14940 +2
Partials 813 813 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e2e.mk
Outdated
-e2e.skip-resource-cleanup=$(SKIP_CLEANUP) -e2e.use-existing-cluster=$(SKIP_CREATE_MGMT_CLUSTER) $(E2E_ARGS) | ||
|
||
.PHONY: test-e2e-cleanup | ||
test-e2e-cleanup: ## Clean up e2e test resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are running this in the context of set -e
(exit on any error) why are we also defensively doing || true
for every step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch! I think that the original idea is that since we are in the deletion phase, we dont want any resources left over even if a failure is encountered here, but I think it's better to remove it.
2670b8e
to
7408512
Compare
e2e.mk
Outdated
fi | ||
|
||
.PHONY: test-e2e-run | ||
test-e2e-run: ## Run e2e tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main changes here appear to be
- don't exit the whole thing if one step in the actual E2E run fails
- do cleanup as a separate step where we do exit if any step in the cleanup fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup exactly. This helps cleanup any aks management clusters that are left over. Without this new flow, if the test fails the aks management cluster is not cleaned up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition would be to keep the "don't exit on any one single error" context during cleanup as well. But maybe I'm missing something? Why do we want to be tolerant of errors during E2E but not during cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'm confused because I also agree that we should not exit on any one single error during cleanup. I just changed it because I thought your comment was suggesting otherwise: #5883 (comment)
ce0dd77
to
7bbb05e
Compare
scripts/ci-e2e.sh
Outdated
|
||
capz::ci-e2e::cleanup() { | ||
"${REPO_ROOT}/hack/log/redact.sh" || true | ||
make cleanup-workload-identity || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've moved these out I would (1) make a new test-e2e-run-cleanup
make target (2) move all of the cleanup tasks from e2e.mk
test-e2e-run
into it and (3) add a $(MAKE) test-e2e-run-cleanup
command at the end of each of the following jobs that invoke test-e2e-run
:
test-e2e-skip-push
test-e2e-skip-build-and-push
test-e2e-custom-image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this suggestion here! lmk what you think a701ca0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, though I still think we need to update scripts/ci-e2e.sh
so that the new make test-e2e-run-cleanup
target is run during the capz::ci-e2e::cleanup
function that we trap on exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, though I still think we need to update
scripts/ci-e2e.sh
so that the newmake test-e2e-run-cleanup
target is run during thecapz::ci-e2e::cleanup
function that we trap on exit
This resulted in the cleanup being invoked twice, so I think we should just keep it out of ci-e2e.sh?
/retest |
/label tide/merge-method-squash |
295204f
to
c08a2d7
Compare
What type of PR is this?
/kind flake
What this PR does / why we need it:
This PR fixes a bug with pdb error:
Message: Cordoned nodes have used all surge nodes but there are more nodes to be upgraded. Please fix your PDBs blocking node drain and kindly retry upgrade operation.
It also fixes the delete flow where the aks management cluster wasn't being deleted when the test fails, only when it succeeds.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
TODOs:
Release note: