-
Notifications
You must be signed in to change notification settings - Fork 462
OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingReserved #5491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingReserved #5491
Conversation
|
Skipping CI for Draft Pull Request. |
|
/payload-aggregate-with-prs periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips 10 openshift/cluster-api#254 |
|
/test all |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/528debb0-d755-11f0-826b-d2c4b80a4f0a-0 |
|
/test ? |
|
@neisw: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
10a91b0 to
9dee7b6
Compare
|
@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold |
|
/payload-aggregate-with-prs periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips 10 openshift/cluster-api#254 |
|
@ngopalak-redhat: given command is invalid: at least one of the commands given is only supported on a one-command-per-comment basis, please separate out commands as multiple comments |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
@neisw: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5c2e6180-d77e-11f0-93f7-7e9662b42cf1-0 |
9dee7b6 to
e7f7854
Compare
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/20904d90-db2b-11f0-82ea-8927f4c1e39c-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4a5f7820-db2c-11f0-8429-c3e9fc589c76-0 |
|
@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e7f7854 to
a844964
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-1of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3 |
|
@ngopalak-redhat: trigger 5 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e62148d0-db6b-11f0-9fb9-c944a402697a-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f5633ab0-db6b-11f0-9e13-27f86c4e078c-0 |
|
/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3 |
|
@ngopalak-redhat: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ae123040-dbb0-11f0-92b8-61ac30a584ff-0 |
… nodes OCPNODE-3719: Default Enablement of Auto Sizing Reserved in OpenShift 4.21 - Enable AutoSizingReserved by default for worker nodes - Disable AutoSizingReserved for master/control-plane nodes - Disable AutoSizingReserved for arbiter nodes - Disable AutoSizingReserved for Hypershift clusters - Add corresponding tests for the new behavior This combines changes from PR openshift#5390.
a844964 to
e97d1a0
Compare
|
/retest-required |
|
/hold cancel |
|
/jira refresh |
|
@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by ci |
|
@ngopalak-redhat: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@haircommander Please review |
|
/lgtm until TRT acks according to the release cycle requirements |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haircommander, ngopalak-redhat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
- What I did
This PR re-introduces the functionality from PR #5390 with a safer scope.
The previous attempt to enable AutoSizingReserved was reverted in PR #5489 following OCPBUGS-66420. The investigation revealed that applying these reservations to Control Plane nodes caused memory starvation (reducing available memory by ~2Gi) during upgrades, leading to cascading failures in etcd and the APIServer.
To address this while still delivering the feature for 4.21, this PR applies the following changes:
Re-enable AutoSizingReserved (Scoped to Workers)
The logic from PR OCPNODE-3719: Default Enablement of Auto Sizing Reserved in OpenShift 4.21 #5390 is brought back but strictly scoped to worker nodes.
Control Plane nodes and Hypershift are excluded to prevent the instability observed in the incident.
This ensures worker nodes benefit from dynamic resource sizing without risking critical control plane components during upgrades.
External References
GKE: Similar dynamic node sizing logic is already utilized by other providers, such as GKE: Plan Node Sizes (GKE).
Existing Capability: OpenShift has previously released this feature in a non-default mode: Red Hat Solution 5843241.
Additional Testing
To ensure stability, we are specifically re-running the test suites that triggered the original revert (upgrade and conformance tests).
We will verify that excluding the control plane nodes resolves the API server and etcd starvation issues observed in the previous attempt.
Impact & User Action
Opt-Out: We do not plan to rollback this change if concerns arise. Instead, customers who need to disable this behavior should utilize the KubeletConfig to opt-out.
Documentation: A blog post and official documentation will be published to explain these changes, allowing customers to adjust workloads or opt-out as needed.
Reference Issues
Reverts: #5489
Original PR: #5390
Incident: OCPBUGS-66420
- How to verify it
Made sure by ssh into the CP and worker nodes and checked the
/etc/node-sizing-enabled.envfile- Description for the changelog
Ensure autoSizingReserved is enabled on worker nodes only