Skip to content

Conversation

pablintino
Copy link
Contributor

@pablintino pablintino commented Oct 6, 2025

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

  1. Deploy a <4.19.10 OCP cluster (important)
  2. As soon as it's deployed, apply the following patch to the Image cluster resource:
  spec:
    allowedRegistriesForImport:
    - domainName: registry.ci.openshift.org
      insecure: false
    - domainName: quay.io
      insecure: false
    - domainName: registry.redhat.io
      insecure: false
    - domainName: registry.connect.redhat.com
      insecure: false
    - domainName: registry.access.redhat.com
      insecure: false
    - domainName: registry-proxy.engineering.redhat.com
      insecure: false
    - domainName: registry.stage.redhat.io
      insecure: false
    - domainName: ghcr.io
      insecure: false
    registrySources:
      allowedRegistries:
      - registry.ci.openshift.org
      - quay.io
      - registry.redhat.io
      - registry.connect.redhat.com
      - registry.access.redhat.com
      - registry-proxy.engineering.redhat.com
      - registry.stage.redhat.io
      - ghcr.io
  1. Wait for the MCO to roll out the update to both worker and master pools
  2. Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.
  3. Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

This commit slightly changes the behaviour of OS updates if PIS is not
configured.
Before this change, if PIS was enabled we checked if the new OS image
was locally present and if so, the OS rebase was requested to be
performed using the local stored copy, no matter if any PinnedImageSet
was available for the node's pools.
With this change that local check is only performed if PIS is enabled
and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any
version that has PIS enabled (from 4.19.12 it's enabled by default) as
the machine-config-nodes-crd-cleanup job uses the target image to run
before the update, catching the image locally and leading to possible
pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their
pull policy is restrictive as this change scope doesn't cover tweaking
the pull policy.

Co-authored-by: Isabella Janssen <[email protected]>
Co-authored-by: Jerry Zhang <[email protected]>
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2025
@openshift-ci-robot
Copy link
Contributor

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from umohnani8 and yuqi-zhang October 6, 2025 09:50
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2025
Copy link
Contributor

openshift-ci bot commented Oct 6, 2025

@pablintino: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-windows 5adf083 link false /test e2e-aws-ovn-windows
ci/prow/okd-scos-e2e-aws-ovn 5adf083 link false /test okd-scos-e2e-aws-ovn
ci/prow/bootstrap-unit 5adf083 link false /test bootstrap-unit
ci/prow/e2e-gcp-op-ocl 5adf083 link false /test e2e-gcp-op-ocl
ci/prow/e2e-azure-ovn-upgrade-out-of-change 5adf083 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-mco-disruptive 5adf083 link false /test e2e-aws-mco-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pablintino
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2025
@openshift-ci-robot
Copy link
Contributor

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr October 6, 2025 16:23
@openshift-ci-robot
Copy link
Contributor

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

  1. Deploy a <4.19.10 OCP cluster (important)
  2. As soon as it's deployed, apply the following patch to the Image cluster resource:
 spec:
   allowedRegistriesForImport:
   - domainName: registry.ci.openshift.org
     insecure: false
   - domainName: quay.io
     insecure: false
   - domainName: registry.redhat.io
     insecure: false
   - domainName: registry.connect.redhat.com
     insecure: false
   - domainName: registry.access.redhat.com
     insecure: false
   - domainName: registry-proxy.engineering.redhat.com
     insecure: false
   - domainName: registry.stage.redhat.io
     insecure: false
   - domainName: ghcr.io
     insecure: false
   registrySources:
     allowedRegistries:
     - registry.ci.openshift.org
     - quay.io
     - registry.redhat.io
     - registry.connect.redhat.com
     - registry.access.redhat.com
     - registry-proxy.engineering.redhat.com
     - registry.stage.redhat.io
     - ghcr.io
  1. Wait for the MCO to roll out the update to both worker and master pools
  2. Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.
  3. Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2025
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, no need for changes, just some questions inline

}

func (dn *Daemon) isPinnedImageSetConfigured() (bool, error) {
if dn.fgHandler == nil || !dn.fgHandler.Enabled(features.FeatureGatePinnedImages) || dn.node == nil || dn.mcpLister == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all versions that would need this fix has already GA'ed PIS, is the additional check here intended to allow for backporting to techpreview versions? Or just additional checks for consistency purposes (until such a time we remove the PIS featuregate entirely)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the exact same thing today. In >4.19.12, does it make sense to preserve the feature gate checks of PIS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it is for consistency since the feature gate is technically still defined in the API. Sure, right now we are only putting this fix into versions where PIS is GA, but what if, for some reason we can't see today, someone wants to backport this further and just cherrypicks this PR? It seems Joel plans to raise bugs when it is time to remove feature gates, so I figured until that time we would keep FG checks for consistency (ref Slack message).

}

// PIS is enabled. Check if it's configured in any of its pools
pools, _, err := helpers.GetPoolsForNode(dn.mcpLister, dn.node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, re-reading the PIS implementation, I found it interesting that we used GetPoolsForNode for general PIS detection (which I assume this is also doing for consistency), since it kind of implies there's some inheritance model in place (i.e. custom pools would count as workers for any PIS pulls, and master+worker nodes would also count as both) like MachineConfigs, which... might be intended? (if I understand correctly, I don't actually ever need any PIS for custom pools, unless for some reason they are using a different base image, but then they'd pull both custom and worker? Not sure if that's the intended workflow)

Not something we should worry about for this PR, just something I wanted to raise while we're here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprissed too, cause the initial drafy implementation I did used the primary pool but after reviewing what @isabella-janssen did I realized everything was using/considering all pools, unfortunately, that didn't trigger an alarm inside me and I changed the call to this one as it's what we use everywhere for PIS.
I think this conversation needs to be captured and properly handled, so I've opened a low prio bug. If the call to GetPoolsForNode everywhere in the PIS code is the right one, we could close the bug with no action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a thread where this was discussed back in March: https://redhat-internal.slack.com/archives/C076EFZF40M/p1741706350105449. I think there is room to reconsider, especially given the concerns you raised in the bug @pablintino.

Copy link
Contributor

openshift-ci bot commented Oct 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, pablintino, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [isabella-janssen,pablintino,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Oct 7, 2025

/verified bypass

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025
@openshift-ci-robot
Copy link
Contributor

@sdodson: The verified label has been added.

In response to this:

/verified bypass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdodson
Copy link
Member

sdodson commented Oct 7, 2025

/cherry-pick release-4.20

@openshift-cherrypick-robot

@sdodson: once the present PR merges, I will cherry-pick it on top of release-4.20 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sdodson sdodson merged commit 9e48587 into openshift:main Oct 7, 2025
15 of 22 checks passed
@openshift-ci-robot
Copy link
Contributor

@pablintino: Jira Issue Verification Checks: Jira Issue OCPBUGS-62510
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-62510 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

  1. Deploy a <4.19.10 OCP cluster (important)
  2. As soon as it's deployed, apply the following patch to the Image cluster resource:
 spec:
   allowedRegistriesForImport:
   - domainName: registry.ci.openshift.org
     insecure: false
   - domainName: quay.io
     insecure: false
   - domainName: registry.redhat.io
     insecure: false
   - domainName: registry.connect.redhat.com
     insecure: false
   - domainName: registry.access.redhat.com
     insecure: false
   - domainName: registry-proxy.engineering.redhat.com
     insecure: false
   - domainName: registry.stage.redhat.io
     insecure: false
   - domainName: ghcr.io
     insecure: false
   registrySources:
     allowedRegistries:
     - registry.ci.openshift.org
     - quay.io
     - registry.redhat.io
     - registry.connect.redhat.com
     - registry.access.redhat.com
     - registry-proxy.engineering.redhat.com
     - registry.stage.redhat.io
     - ghcr.io
  1. Wait for the MCO to roll out the update to both worker and master pools
  2. Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.
  3. Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@sdodson: new pull request created: #5340

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ptalgulk01
Copy link

Pre-merge tested:
Verified using 4.20 IPI based Azure cluster.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.20.0-0.nightly-2025-10-07-014413   True        False         90m     Cluster version is 4.20.0-0.nightly-2025-10-07-014413
  • Patch the following different image registry in spec field for Image resources
$ oc edit image.config.openshift.io/cluster -o yaml
  allowedRegistriesForImport:
  - domainName: registry.ci.openshift.org
    insecure: false
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.connect.redhat.com
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry-proxy.engineering.redhat.com
    insecure: false
  - domainName: registry.stage.redhat.io
    insecure: false
  - domainName: ghcr.io
    insecure: false
  - domainName: registry.build10.ci.openshift.org
    insecure: false
  registrySources:
    allowedRegistries:
    - registry.ci.openshift.org
    - quay.io
    - registry.redhat.io
    - registry.connect.redhat.com
    - registry.access.redhat.com
    - registry-proxy.engineering.redhat.com
    - registry.stage.redhat.io
    - ghcr.io
    - registry.build10.ci.openshift.org
  • Wait for the MCP update to complete
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-8f42c8109984e0091d0d41a421d8b2f9   False     True       False      3              0                   0                     0                      128m
worker   rendered-worker-2000262d3ba088b4a14fb8d622c49902   False     True       False      3              0                   0                     0                      128m
  • Check the policy is updated
Logs
$ oc debug node/ppt-07-20a-m9tql-worker-eastus2-f8md4 -- chroot /host cat /etc/containers/policy.json
Starting pod/ppt-07-20a-m9tql-worker-eastus2-f8md4-debug-ntmht ...
To use host binaries, run `chroot /host`
{
  "default": [
    {
      "type": "reject"
    }
  ],
  "transports": {
    "atomic": {
      "ghcr.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "quay.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry-proxy.engineering.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.access.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.build10.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.connect.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.stage.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    },
    "docker": {
      "ghcr.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "quay.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry-proxy.engineering.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.access.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.build10.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.connect.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.stage.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    },
    "docker-daemon": {
      "": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    }
  }
}
Removing debug pod ...
  • Upgrade the cluster to this image
$ oc adm upgrade --to-image registry.build10.ci.openshift.org/ci-ln-dp0svf2/release:latest  --allow-explicit-upgrade --force 
  • Check the upgrade s succeed properly
$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest   True        False         16m     Cluster version is 4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest

$ oc get co machine-config
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest   True        False         False      4h23m   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants