Skip to content

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Oct 3, 2025

- What I did
This PR adds support for boot image updates to ControlPlaneMachineSet for the AWS, Azure and GCP platforms. A couple of key points to know about CPMS:

  • They are singletons in the Machine API namespace; typically named cluster. The boot images are stored under spec, in a field similar to MachineSets. For example, in AWS(abbreviated to only important fields):
spec:
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: ci-op-l4pngh10-79b69-zrm8p
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        providerSpec:
          value:
            ami:
              id: ami-09d23adad19cdb25c
  • They have a rollout strategy defined in spec.strategy.type, which can be set RollingUpdate, Recreate or OnDelete. In RollingUpdate mode, this meant that any deviation in the spec of the CPMS from the nodes will cause a complete control plane replacement, which is undesirable if the only deviation was boot images. This is because the nodes pivot to the latest RHCOS image described by the OCP release image, and it would effectively be no-op, adding to upgrade time. To avoid this issue, the CPMS operator was updated to ignore boot image fields during control plane machine reconciliation.

- How to verify it

  1. Create an AWS/GCP/Azure cluster in the TechPreview featureset.
  2. Take a back-up of the current CPMS object named cluster for comparison purposes.
  3. Opt-in for CPMS boot image updates using the MachineConfiguration object:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
spec:
  logLevel: Normal
  operatorLogLevel: Normal
  managedBootImages:
    machineManagers:
      - resource: controlplanemachinesets
        apiGroup: machine.openshift.io
        selection:
          mode: All
  1. Modify the boot image field to an older value. This will vary per platform:
  • For AWS, use an older known AMI like ami-00abe7f9c6bd85a77.
  • For GCP, modify the image field to any value that starts with projects/rhcos-cloud/global/images/, for example projects/rhcos-cloud/global/images/test.
  • For Azure, the existing boot image will automatically be updated without any manipulation by you. This is because Azure clusters currently use gallery images and will be updated to use the latest marketplace images. When Azure clusters are updated to install with marketplace images, the user will be required to manipulate the image to test the Azure platform.
  1. Examine the MachineConfiguration object's status to see if the CPMS was reconciled successfully. The CPMS boot image fields should reflect the values you initially saw post-install. These are the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update took place.
  2. You can now attempt to resize the control plane by deleting one of the control plane machines. The CPMS operator should scale up a new machine to satisfy its spec.replicas value, and it should be able to do so successfully.
  3. Now, opt-out the cluster from CPMS boot image updates:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
spec:
  logLevel: Normal
  operatorLogLevel: Normal
  managedBootImages:
    machineManagers:
      - resource: controlplanemachinesets
        apiGroup: machine.openshift.io
        selection:
          mode: None
  1. Modify the boot image to an older value(see step 4). For Azure, you could modify the version field to an older value.
  2. Examine the MachineConfiguration object's status to see if the CPMS object was reconciled successfully. The CPMS boot image fields should reflect the values you set, and not the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update did not take place.
  3. All done! You have now successfully tested CPMS boot image updates!

Note: Since these are singleton objects, the Partial selection mode is not permitted while specifying boot image configuration. Hence, that mode does not need to be tested. The APIServer will reject any attempt to set Partial for CPMS objects, so I suppose that is something to test as well! 😄

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 3, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2025
Copy link
Contributor

openshift-ci bot commented Oct 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 3, 2025

@djoshy: This pull request references MCO-1807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

[DNM, testing]

Opened for initial testing. Currently, the controller looks at the standard MAPI MachineSet boot image opinion; when openshift/api#2396 lands, this PR can be updated to be actually check for the CPMS type. It also does not look for the CPMS feature gate for the same reason.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Oct 3, 2025

/test all

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 6, 2025

@djoshy: This pull request references MCO-1807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

[DNM, testing]

Opened for initial testing. Currently, the controller looks at the standard MAPI MachineSet boot image opinion; when openshift/api#2396 lands, this PR can be updated to be actually check for the CPMS type. It also does not look for the CPMS feature gate for the same reason.

- What I did
This PR adds support for boot image updates to ControlPlaneMachineSet for the AWS, Azure and GCP platforms. A couple of key points to know about CPMS:

  • They are singletons in the Machine API namespace; typically named cluster. The boot images are stored under spec, in a field similar to MachineSets. For example, in AWS(abbreviated to only important fields):
spec:
 template:
   machineType: machines_v1beta1_machine_openshift_io
   machines_v1beta1_machine_openshift_io:
     metadata:
       labels:
         machine.openshift.io/cluster-api-cluster: ci-op-l4pngh10-79b69-zrm8p
         machine.openshift.io/cluster-api-machine-role: master
         machine.openshift.io/cluster-api-machine-type: master
     spec:
       providerSpec:
         value:
           ami:
             id: ami-09d23adad19cdb25c
  • They have a rollout strategy defined in spec.strategy.type, which can be set RollingUpdate, Recreate or OnDelete. In RollingUpdate mode, this meant that any deviation in the spec of the CPMS from the nodes will cause a complete control plane replacement, which is undesirable if the only deviation was boot images. This is because the nodes pivot to the latest RHCOS image described by the OCP release image, and it would be no-op. To avoid this issue, the CPMS operator was updated to ignore boot image fields during control plane machine reconciliation.

- How to verify it

  1. Create an AWS/GCP/Azure cluster in the TechPreview featureset.
  2. Take a back-up of the current CPMS object named cluster for comparison purposes.
  3. Opt-in for CPMS boot image updates using the MachineConfiguration object:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: All
  1. Modify the boot image field to an older value. This will vary per platform:
  • For AWS, use an older known AMI like ami-00abe7f9c6bd85a77.
  • For GCP, modify the image field to any value that starts with projects/rhcos-cloud/global/images/, for example projects/rhcos-cloud/global/images/test.
  • For Azure, the existing boot image will automatically be updated without any manipulation by you. This is because Azure clusters currently use gallery images and will be updated to use the latest marketplace images. When Azure clusters are updated to install with marketplace images, the user will be required to manipulate the image to test the Azure platform.
  1. Examine the MachineConfiguration object's status to see if the CPMS was reconciled successfully. The CPMS boot image fields should reflect the values you initially saw post-install. These are the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update took place.
  2. You can now attempt to resize the control plane by deleting one of the control plane machines. The CPMS operator should scale up a new machine to satisfy its spec.replicas value, and it should be able to do so successfully.
  3. Now, opt-out the cluster from CPMS boot image updates:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: None
  1. Modify the boot image to an older value(see step 4). For Azure, you could modify the version field to an older value.
  2. Examine the MachineConfiguration object's status to see if the CPMS object was reconciled successfully. The CPMS boot image fields should reflect the values you set, and not the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update did not take place.
  3. All done! You have now successfully tested CPMS boot image updates!

Note: Since these are singleton objects, the Partial selection mode is not permitted while specifying boot image configuration. Hence, that mode does not need to be tested. The APIServer will reject any attempt to set Partial for CPMS objects, so I suppose that is something to test as well! 😄

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 6, 2025

@djoshy: This pull request references MCO-1807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

[DNM, testing]

Opened for initial testing. Currently, the controller looks at the standard MAPI MachineSet boot image opinion; when openshift/api#2396 lands, this PR can be updated to be actually check for the CPMS type. It also does not look for the CPMS feature gate for the same reason.

- What I did
This PR adds support for boot image updates to ControlPlaneMachineSet for the AWS, Azure and GCP platforms. A couple of key points to know about CPMS:

  • They are singletons in the Machine API namespace; typically named cluster. The boot images are stored under spec, in a field similar to MachineSets. For example, in AWS(abbreviated to only important fields):
spec:
 template:
   machineType: machines_v1beta1_machine_openshift_io
   machines_v1beta1_machine_openshift_io:
     metadata:
       labels:
         machine.openshift.io/cluster-api-cluster: ci-op-l4pngh10-79b69-zrm8p
         machine.openshift.io/cluster-api-machine-role: master
         machine.openshift.io/cluster-api-machine-type: master
     spec:
       providerSpec:
         value:
           ami:
             id: ami-09d23adad19cdb25c
  • They have a rollout strategy defined in spec.strategy.type, which can be set RollingUpdate, Recreate or OnDelete. In RollingUpdate mode, this meant that any deviation in the spec of the CPMS from the nodes will cause a complete control plane replacement, which is undesirable if the only deviation was boot images. This is because the nodes pivot to the latest RHCOS image described by the OCP release image, and it would effectively be no-op, adding to upgrade time. To avoid this issue, the CPMS operator was updated to ignore boot image fields during control plane machine reconciliation.

- How to verify it

  1. Create an AWS/GCP/Azure cluster in the TechPreview featureset.
  2. Take a back-up of the current CPMS object named cluster for comparison purposes.
  3. Opt-in for CPMS boot image updates using the MachineConfiguration object:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: All
  1. Modify the boot image field to an older value. This will vary per platform:
  • For AWS, use an older known AMI like ami-00abe7f9c6bd85a77.
  • For GCP, modify the image field to any value that starts with projects/rhcos-cloud/global/images/, for example projects/rhcos-cloud/global/images/test.
  • For Azure, the existing boot image will automatically be updated without any manipulation by you. This is because Azure clusters currently use gallery images and will be updated to use the latest marketplace images. When Azure clusters are updated to install with marketplace images, the user will be required to manipulate the image to test the Azure platform.
  1. Examine the MachineConfiguration object's status to see if the CPMS was reconciled successfully. The CPMS boot image fields should reflect the values you initially saw post-install. These are the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update took place.
  2. You can now attempt to resize the control plane by deleting one of the control plane machines. The CPMS operator should scale up a new machine to satisfy its spec.replicas value, and it should be able to do so successfully.
  3. Now, opt-out the cluster from CPMS boot image updates:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: None
  1. Modify the boot image to an older value(see step 4). For Azure, you could modify the version field to an older value.
  2. Examine the MachineConfiguration object's status to see if the CPMS object was reconciled successfully. The CPMS boot image fields should reflect the values you set, and not the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update did not take place.
  3. All done! You have now successfully tested CPMS boot image updates!

Note: Since these are singleton objects, the Partial selection mode is not permitted while specifying boot image configuration. Hence, that mode does not need to be tested. The APIServer will reject any attempt to set Partial for CPMS objects, so I suppose that is something to test as well! 😄

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Contributor Author

djoshy commented Oct 6, 2025

/test verify

This captures updates for the ManagedBootImages API
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 9, 2025

@djoshy: This pull request references MCO-1807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

- What I did
This PR adds support for boot image updates to ControlPlaneMachineSet for the AWS, Azure and GCP platforms. A couple of key points to know about CPMS:

  • They are singletons in the Machine API namespace; typically named cluster. The boot images are stored under spec, in a field similar to MachineSets. For example, in AWS(abbreviated to only important fields):
spec:
 template:
   machineType: machines_v1beta1_machine_openshift_io
   machines_v1beta1_machine_openshift_io:
     metadata:
       labels:
         machine.openshift.io/cluster-api-cluster: ci-op-l4pngh10-79b69-zrm8p
         machine.openshift.io/cluster-api-machine-role: master
         machine.openshift.io/cluster-api-machine-type: master
     spec:
       providerSpec:
         value:
           ami:
             id: ami-09d23adad19cdb25c
  • They have a rollout strategy defined in spec.strategy.type, which can be set RollingUpdate, Recreate or OnDelete. In RollingUpdate mode, this meant that any deviation in the spec of the CPMS from the nodes will cause a complete control plane replacement, which is undesirable if the only deviation was boot images. This is because the nodes pivot to the latest RHCOS image described by the OCP release image, and it would effectively be no-op, adding to upgrade time. To avoid this issue, the CPMS operator was updated to ignore boot image fields during control plane machine reconciliation.

- How to verify it

  1. Create an AWS/GCP/Azure cluster in the TechPreview featureset.
  2. Take a back-up of the current CPMS object named cluster for comparison purposes.
  3. Opt-in for CPMS boot image updates using the MachineConfiguration object:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: All
  1. Modify the boot image field to an older value. This will vary per platform:
  • For AWS, use an older known AMI like ami-00abe7f9c6bd85a77.
  • For GCP, modify the image field to any value that starts with projects/rhcos-cloud/global/images/, for example projects/rhcos-cloud/global/images/test.
  • For Azure, the existing boot image will automatically be updated without any manipulation by you. This is because Azure clusters currently use gallery images and will be updated to use the latest marketplace images. When Azure clusters are updated to install with marketplace images, the user will be required to manipulate the image to test the Azure platform.
  1. Examine the MachineConfiguration object's status to see if the CPMS was reconciled successfully. The CPMS boot image fields should reflect the values you initially saw post-install. These are the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update took place.
  2. You can now attempt to resize the control plane by deleting one of the control plane machines. The CPMS operator should scale up a new machine to satisfy its spec.replicas value, and it should be able to do so successfully.
  3. Now, opt-out the cluster from CPMS boot image updates:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 logLevel: Normal
 operatorLogLevel: Normal
 managedBootImages:
   machineManagers:
     - resource: controlplanemachinesets
       apiGroup: machine.openshift.io
       selection:
         mode: None
  1. Modify the boot image to an older value(see step 4). For Azure, you could modify the version field to an older value.
  2. Examine the MachineConfiguration object's status to see if the CPMS object was reconciled successfully. The CPMS boot image fields should reflect the values you set, and not the values described in the coreos-bootimages configmap. The machine-config-controller logs should also mention that a boot image update did not take place.
  3. All done! You have now successfully tested CPMS boot image updates!

Note: Since these are singleton objects, the Partial selection mode is not permitted while specifying boot image configuration. Hence, that mode does not need to be tested. The APIServer will reject any attempt to set Partial for CPMS objects, so I suppose that is something to test as well! 😄

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy marked this pull request as ready for review October 9, 2025 13:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 9, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Oct 9, 2025

Opening this up for initial review; I've integrated the API from openshift/api#2396.

go func() { ctrl.syncMAPIMachineSets("MAPIMachinesetDeleted") }()
}

func (ctrl *Controller) addControlPlaneMachineSet(obj interface{}) {
Copy link
Member

@isabella-janssen isabella-janssen Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit: comments overviewing the functions throughout this file might be useful (though the function names are pretty self-explanatory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will update on my next pass 😄

Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl 502f475 link false /test e2e-gcp-op-ocl
ci/prow/e2e-azure-ovn-upgrade-out-of-change 502f475 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-mco-disruptive 502f475 link false /test e2e-aws-mco-disruptive
ci/prow/e2e-gcp-op-single-node 42d5df2 link true /test e2e-gcp-op-single-node
ci/prow/okd-scos-e2e-aws-ovn 42d5df2 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants