Skip to content

Conversation

@ngopalak-redhat
Copy link
Contributor

@ngopalak-redhat ngopalak-redhat commented Dec 12, 2025

- What I did

This PR re-introduces the functionality from PR #5390 with a safer scope.

The previous attempt to enable AutoSizingReserved was reverted in PR #5489 following OCPBUGS-66420. The investigation revealed that applying these reservations to Control Plane nodes caused memory starvation (reducing available memory by ~2Gi) during upgrades, leading to cascading failures in etcd and the APIServer.

To address this while still delivering the feature for 4.21, this PR applies the following changes:

Re-enable AutoSizingReserved (Scoped to Workers)

External References

GKE: Similar dynamic node sizing logic is already utilized by other providers, such as GKE: Plan Node Sizes (GKE).

Existing Capability: OpenShift has previously released this feature in a non-default mode: Red Hat Solution 5843241.

Additional Testing

To ensure stability, we are specifically re-running the test suites that triggered the original revert (upgrade and conformance tests).

We will verify that excluding the control plane nodes resolves the API server and etcd starvation issues observed in the previous attempt.

Impact & User Action

Opt-Out: We do not plan to rollback this change if concerns arise. Instead, customers who need to disable this behavior should utilize the KubeletConfig to opt-out.

Documentation: A blog post and official documentation will be published to explain these changes, allowing customers to adjust workloads or opt-out as needed.

Reference Issues

Reverts: #5489

Original PR: #5390

Incident: OCPBUGS-66420

- How to verify it

Made sure by ssh into the CP and worker nodes and checked the /etc/node-sizing-enabled.env file

- Description for the changelog

Ensure autoSizingReserved is enabled on worker nodes only

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 12, 2025
@ngopalak-redhat
Copy link
Contributor Author

/payload-aggregate-with-prs periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips 10 openshift/cluster-api#254

@ngopalak-redhat
Copy link
Contributor Author

/test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/528debb0-d755-11f0-826b-d2c4b80a4f0a-0

@neisw
Copy link
Contributor

neisw commented Dec 12, 2025

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@neisw: The following commands are available to trigger required jobs:

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op-1of2
/test e2e-gcp-op-2of2
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test bootstrap-unit
/test e2e-agent-compact-ipv4
/test e2e-aws-disruptive
/test e2e-aws-mco-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-ocb-techpreview
/test e2e-aws-ovn-serial-ipsec
/test e2e-aws-ovn-upgrade-ipsec
/test e2e-aws-ovn-upgrade-ocb-techpreview
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-windows
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-multidisk-techpreview
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-mco-disruptive
/test e2e-gcp-op
/test e2e-gcp-op-ocl
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-hypershift-techpreview
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-hypershift
/test e2e-openstack-parallel
/test e2e-openstack-singlestackv6
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-ovn-disk-setup-techpreview
/test e2e-vsphere-ovn-upi
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-upgrade
/test okd-scos-e2e-aws-ovn
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-main-bootstrap-unit
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-1of2
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-2of2
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-main-e2e-hypershift
pull-ci-openshift-machine-config-operator-main-images
pull-ci-openshift-machine-config-operator-main-okd-scos-images
pull-ci-openshift-machine-config-operator-main-security
pull-ci-openshift-machine-config-operator-main-unit
pull-ci-openshift-machine-config-operator-main-verify
pull-ci-openshift-machine-config-operator-main-verify-deps
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/mco_auto_node_master_disable branch from 10a91b0 to 9dee7b6 Compare December 12, 2025 15:14
@ngopalak-redhat ngopalak-redhat changed the title Keep control plane not auto-sized OCPNODE-3973: Keep control plane not auto-sized Dec 12, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 12, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 12, 2025

@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead.

Details

In response to this:

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 12, 2025

@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead.

Details

In response to this:

- What I did

Address the issue with periodic job discussed here:
Since the issue was seen in the control plane, the solution here is to keep it enabled for worker nodes

- How to verify it

Made sure by ssh into the CP and worker nodes and checked the /etc/node-sizing-enabled.env file

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 12, 2025

@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead.

Details

In response to this:

- What I did

Address the issue with periodic job discussed here:
Since the issue was seen in the control plane, the solution here is to keep it enabled for worker nodes

- How to verify it

Made sure by ssh into the CP and worker nodes and checked the /etc/node-sizing-enabled.env file

- Description for the changelog

Ensure autoSizingReserved is enabled on worker nodes only

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review December 12, 2025 15:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 12, 2025
@ngopalak-redhat
Copy link
Contributor Author

/hold
Waiting for #5491 (comment) to pass

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 12, 2025
@ngopalak-redhat
Copy link
Contributor Author

/payload-aggregate-with-prs periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips 10 openshift/cluster-api#254
/payload-aggregate-with-prs periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10 openshift/cluster-api#254

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@ngopalak-redhat: given command is invalid: at least one of the commands given is only supported on a one-command-per-comment basis, please separate out commands as multiple comments

@neisw
Copy link
Contributor

neisw commented Dec 12, 2025

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@neisw: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5c2e6180-d77e-11f0-93f7-7e9662b42cf1-0

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2025
@ngopalak-redhat ngopalak-redhat changed the title OCPNODE-3973: Keep control plane not auto-sized OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingResreved/SystemReservedCompressible Dec 16, 2025
@ngopalak-redhat ngopalak-redhat changed the title OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingResreved/SystemReservedCompressible OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingReserved/SystemReservedCompressible Dec 16, 2025
@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/mco_auto_node_master_disable branch from 9dee7b6 to e7f7854 Compare December 16, 2025 02:58
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2025
@ngopalak-redhat
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10
/test e2e-aws-mco-disruptive

@ngopalak-redhat
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/20904d90-db2b-11f0-82ea-8927f4c1e39c-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4a5f7820-db2c-11f0-8429-c3e9fc589c76-0

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 17, 2025

@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead.

Details

In response to this:

- What I did

This PR re-introduces the functionality from PR #5390 with a safer scope.

The previous attempt to enable AutoSizingReserved was reverted in PR #5489 following OCPBUGS-66420. The investigation revealed that applying these reservations to Control Plane nodes caused memory starvation (reducing available memory by ~2Gi) during upgrades, leading to cascading failures in etcd and the APIServer.

To address this while still delivering the feature for 4.21, this PR applies the following changes:

Re-enable AutoSizingReserved (Scoped to Workers)

External References

GKE: Similar dynamic node sizing logic is already utilized by other providers, such as GKE: Plan Node Sizes (GKE).

Existing Capability: OpenShift has previously released this feature in a non-default mode: Red Hat Solution 5843241.

Additional Testing

To ensure stability, we are specifically re-running the test suites that triggered the original revert (upgrade and conformance tests).

We will verify that excluding the control plane nodes resolves the API server and etcd starvation issues observed in the previous attempt.

Impact & User Action

Opt-Out: We do not plan to rollback this change if concerns arise. Instead, customers who need to disable this behavior should utilize the KubeletConfig to opt-out.

Documentation: A blog post and official documentation will be published to explain these changes, allowing customers to adjust workloads or opt-out as needed.

Reference Issues

Reverts: #5489

Original PR: #5390

Incident: OCPBUGS-66420

- How to verify it

Made sure by ssh into the CP and worker nodes and checked the /etc/node-sizing-enabled.env file

- Description for the changelog

Ensure autoSizingReserved is enabled on worker nodes only

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat ngopalak-redhat changed the title OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingReserved/SystemReservedCompressible OCPNODE-3973: Default CPU/Memory changes to Workers - AutoSizingReserved Dec 17, 2025
@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/mco_auto_node_master_disable branch from e7f7854 to a844964 Compare December 17, 2025 17:13
@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-1of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@ngopalak-redhat: trigger 5 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e62148d0-db6b-11f0-9fb9-c944a402697a-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f5633ab0-db6b-11f0-9e13-27f86c4e078c-0

@ngopalak-redhat
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3 periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

@ngopalak-redhat: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ae123040-dbb0-11f0-92b8-61ac30a584ff-0

… nodes

OCPNODE-3719: Default Enablement of Auto Sizing Reserved in OpenShift 4.21

- Enable AutoSizingReserved by default for worker nodes
- Disable AutoSizingReserved for master/control-plane nodes
- Disable AutoSizingReserved for arbiter nodes
- Disable AutoSizingReserved for Hypershift clusters
- Add corresponding tests for the new behavior

This combines changes from PR openshift#5390.
@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/mco_auto_node_master_disable branch from a844964 to e97d1a0 Compare December 18, 2025 01:32
@ngopalak-redhat
Copy link
Contributor Author

/retest-required

@ngopalak-redhat
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2025
@ngopalak-redhat
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 18, 2025

@ngopalak-redhat: This pull request references OCPNODE-3973 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

@ngopalak-redhat: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit e97d1a0 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ngopalak-redhat
Copy link
Contributor Author

/verified by ci

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 18, 2025
@openshift-ci-robot
Copy link
Contributor

@ngopalak-redhat: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat
Copy link
Contributor Author

@haircommander Please review

@haircommander
Copy link
Member

/lgtm
/hold

until TRT acks according to the release cycle requirements

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander, ngopalak-redhat
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants