OCPBUGS-62510: Skip rpm-ostree local rebase if no PIS #5333

pablintino · 2025-10-06T09:50:15Z

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

Deploy a <4.19.10 OCP cluster (important)
As soon as it's deployed, apply the following patch to the Image cluster resource:

  spec:
    allowedRegistriesForImport:
    - domainName: registry.ci.openshift.org
      insecure: false
    - domainName: quay.io
      insecure: false
    - domainName: registry.redhat.io
      insecure: false
    - domainName: registry.connect.redhat.com
      insecure: false
    - domainName: registry.access.redhat.com
      insecure: false
    - domainName: registry-proxy.engineering.redhat.com
      insecure: false
    - domainName: registry.stage.redhat.io
      insecure: false
    - domainName: ghcr.io
      insecure: false
    registrySources:
      allowedRegistries:
      - registry.ci.openshift.org
      - quay.io
      - registry.redhat.io
      - registry.connect.redhat.com
      - registry.access.redhat.com
      - registry-proxy.engineering.redhat.com
      - registry.stage.redhat.io
      - ghcr.io

Wait for the MCO to roll out the update to both worker and master pools
Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.
Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

This commit slightly changes the behaviour of OS updates if PIS is not configured. Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools. With this change that local check is only performed if PIS is enabled and configured. This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls. Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy. Co-authored-by: Isabella Janssen <[email protected]> Co-authored-by: Jerry Zhang <[email protected]>

openshift-ci-robot · 2025-10-06T09:50:22Z

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-06T14:40:01Z

@pablintino: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-windows	`5adf083`	link	false	`/test e2e-aws-ovn-windows`
ci/prow/okd-scos-e2e-aws-ovn	`5adf083`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/bootstrap-unit	`5adf083`	link	false	`/test bootstrap-unit`
ci/prow/e2e-gcp-op-ocl	`5adf083`	link	false	`/test e2e-gcp-op-ocl`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`5adf083`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/e2e-aws-mco-disruptive	`5adf083`	link	false	`/test e2e-aws-mco-disruptive`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pablintino · 2025-10-06T16:23:27Z

/jira refresh

openshift-ci-robot · 2025-10-06T16:23:37Z

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-10-06T16:29:51Z

@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

Deploy a <4.19.10 OCP cluster (important)

As soon as it's deployed, apply the following patch to the Image cluster resource:
 spec:
   allowedRegistriesForImport:
   - domainName: registry.ci.openshift.org
     insecure: false
   - domainName: quay.io
     insecure: false
   - domainName: registry.redhat.io
     insecure: false
   - domainName: registry.connect.redhat.com
     insecure: false
   - domainName: registry.access.redhat.com
     insecure: false
   - domainName: registry-proxy.engineering.redhat.com
     insecure: false
   - domainName: registry.stage.redhat.io
     insecure: false
   - domainName: ghcr.io
     insecure: false
   registrySources:
     allowedRegistries:
     - registry.ci.openshift.org
     - quay.io
     - registry.redhat.io
     - registry.connect.redhat.com
     - registry.access.redhat.com
     - registry-proxy.engineering.redhat.com
     - registry.stage.redhat.io
     - ghcr.io
Wait for the MCO to roll out the update to both worker and master pools

Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.

Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

isabella-janssen

/lgtm

yuqi-zhang

Generally lgtm, no need for changes, just some questions inline

yuqi-zhang · 2025-10-06T21:06:28Z

pkg/daemon/update.go

 }

+func (dn *Daemon) isPinnedImageSetConfigured() (bool, error) {
+	if dn.fgHandler == nil || !dn.fgHandler.Enabled(features.FeatureGatePinnedImages) || dn.node == nil || dn.mcpLister == nil {


Since all versions that would need this fix has already GA'ed PIS, is the additional check here intended to allow for backporting to techpreview versions? Or just additional checks for consistency purposes (until such a time we remove the PIS featuregate entirely)

I was thinking the exact same thing today. In >4.19.12, does it make sense to preserve the feature gate checks of PIS?

I would say it is for consistency since the feature gate is technically still defined in the API. Sure, right now we are only putting this fix into versions where PIS is GA, but what if, for some reason we can't see today, someone wants to backport this further and just cherrypicks this PR? It seems Joel plans to raise bugs when it is time to remove feature gates, so I figured until that time we would keep FG checks for consistency (ref Slack message).

yuqi-zhang · 2025-10-06T21:15:37Z

pkg/daemon/update.go

+	}
+
+	// PIS is enabled. Check if it's configured in any of its pools
+	pools, _, err := helpers.GetPoolsForNode(dn.mcpLister, dn.node)


Hmm, re-reading the PIS implementation, I found it interesting that we used GetPoolsForNode for general PIS detection (which I assume this is also doing for consistency), since it kind of implies there's some inheritance model in place (i.e. custom pools would count as workers for any PIS pulls, and master+worker nodes would also count as both) like MachineConfigs, which... might be intended? (if I understand correctly, I don't actually ever need any PIS for custom pools, unless for some reason they are using a different base image, but then they'd pull both custom and worker? Not sure if that's the intended workflow)

Not something we should worry about for this PR, just something I wanted to raise while we're here.

I was surprissed too, cause the initial drafy implementation I did used the primary pool but after reviewing what @isabella-janssen did I realized everything was using/considering all pools, unfortunately, that didn't trigger an alarm inside me and I changed the call to this one as it's what we use everywhere for PIS.
I think this conversation needs to be captured and properly handled, so I've opened a low prio bug. If the call to GetPoolsForNode everywhere in the PIS code is the right one, we could close the bug with no action.

Here's a thread where this was discussed back in March: https://redhat-internal.slack.com/archives/C076EFZF40M/p1741706350105449. I think there is room to reconsider, especially given the concerns you raised in the bug @pablintino.

openshift-ci · 2025-10-06T21:17:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, pablintino, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [isabella-janssen,pablintino,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdodson · 2025-10-07T14:28:18Z

/verified bypass

openshift-ci-robot · 2025-10-07T14:28:32Z

@sdodson: The verified label has been added.

In response to this:

/verified bypass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sdodson · 2025-10-07T14:35:23Z

/cherry-pick release-4.20

openshift-cherrypick-robot · 2025-10-07T14:35:26Z

@sdodson: once the present PR merges, I will cherry-pick it on top of release-4.20 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-10-07T14:37:32Z

@pablintino: Jira Issue Verification Checks: Jira Issue OCPBUGS-62510
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-62510 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

Closes: #OCPBUGS-62510

- What I did

This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.

This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.

Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.

- How to verify it

Deploy a <4.19.10 OCP cluster (important)

As soon as it's deployed, apply the following patch to the Image cluster resource:
 spec:
   allowedRegistriesForImport:
   - domainName: registry.ci.openshift.org
     insecure: false
   - domainName: quay.io
     insecure: false
   - domainName: registry.redhat.io
     insecure: false
   - domainName: registry.connect.redhat.com
     insecure: false
   - domainName: registry.access.redhat.com
     insecure: false
   - domainName: registry-proxy.engineering.redhat.com
     insecure: false
   - domainName: registry.stage.redhat.io
     insecure: false
   - domainName: ghcr.io
     insecure: false
   registrySources:
     allowedRegistries:
     - registry.ci.openshift.org
     - quay.io
     - registry.redhat.io
     - registry.connect.redhat.com
     - registry.access.redhat.com
     - registry-proxy.engineering.redhat.com
     - registry.stage.redhat.io
     - ghcr.io
Wait for the MCO to roll out the update to both worker and master pools

Check that node's /etc/containers/policy.json has been updated and reflects the changes performed to the Image resource.

Trigger an update to a release payload patched with this change. The update should succeed.

- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2025-10-07T14:38:17Z

@sdodson: new pull request created: #5340

In response to this:

/cherry-pick release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ptalgulk01 · 2025-10-07T14:45:13Z

Pre-merge tested:
Verified using 4.20 IPI based Azure cluster.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.20.0-0.nightly-2025-10-07-014413   True        False         90m     Cluster version is 4.20.0-0.nightly-2025-10-07-014413

Patch the following different image registry in spec field for Image resources

$ oc edit image.config.openshift.io/cluster -o yaml
  allowedRegistriesForImport:
  - domainName: registry.ci.openshift.org
    insecure: false
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.connect.redhat.com
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry-proxy.engineering.redhat.com
    insecure: false
  - domainName: registry.stage.redhat.io
    insecure: false
  - domainName: ghcr.io
    insecure: false
  - domainName: registry.build10.ci.openshift.org
    insecure: false
  registrySources:
    allowedRegistries:
    - registry.ci.openshift.org
    - quay.io
    - registry.redhat.io
    - registry.connect.redhat.com
    - registry.access.redhat.com
    - registry-proxy.engineering.redhat.com
    - registry.stage.redhat.io
    - ghcr.io
    - registry.build10.ci.openshift.org

Wait for the MCP update to complete

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-8f42c8109984e0091d0d41a421d8b2f9   False     True       False      3              0                   0                     0                      128m
worker   rendered-worker-2000262d3ba088b4a14fb8d622c49902   False     True       False      3              0                   0                     0                      128m

Check the policy is updated

Logs

$ oc debug node/ppt-07-20a-m9tql-worker-eastus2-f8md4 -- chroot /host cat /etc/containers/policy.json
Starting pod/ppt-07-20a-m9tql-worker-eastus2-f8md4-debug-ntmht ...
To use host binaries, run `chroot /host`
{
  "default": [
    {
      "type": "reject"
    }
  ],
  "transports": {
    "atomic": {
      "ghcr.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "quay.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry-proxy.engineering.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.access.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.build10.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.connect.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.stage.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    },
    "docker": {
      "ghcr.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "quay.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry-proxy.engineering.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.access.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.build10.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.ci.openshift.org": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.connect.redhat.com": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ],
      "registry.stage.redhat.io": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    },
    "docker-daemon": {
      "": [
        {
          "type": "insecureAcceptAnything"
        }
      ]
    }
  }
}
Removing debug pod ...

Upgrade the cluster to this image

$ oc adm upgrade --to-image registry.build10.ci.openshift.org/ci-ln-dp0svf2/release:latest  --allow-explicit-upgrade --force

Check the upgrade s succeed properly

$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest   True        False         16m     Cluster version is 4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest

$ oc get co machine-config
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest   True        False         False      4h23m

openshift-ci bot requested review from umohnani8 and yuqi-zhang October 6, 2025 09:50

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2025

openshift-ci bot requested a review from sergiordlr October 6, 2025 16:23

isabella-janssen reviewed Oct 6, 2025

View reviewed changes

openshift-ci bot assigned isabella-janssen Oct 6, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2025

yuqi-zhang approved these changes Oct 6, 2025

View reviewed changes

pablintino mentioned this pull request Oct 7, 2025

OCPBUGS-62788: Skip rpm-ostree local rebase if no PIS #5337

Open

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025

sdodson merged commit 9e48587 into openshift:main Oct 7, 2025
15 of 22 checks passed

openshift-cherrypick-robot mentioned this pull request Oct 7, 2025

[release-4.20] OCPBUGS-62803: Skip rpm-ostree local rebase if no PIS #5340

Open

OCPBUGS-62510: Skip rpm-ostree local rebase if no PIS #5333

OCPBUGS-62510: Skip rpm-ostree local rebase if no PIS #5333

Uh oh!

Conversation

pablintino commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 6, 2025

Uh oh!

openshift-ci bot commented Oct 6, 2025

Uh oh!

pablintino commented Oct 6, 2025

Uh oh!

openshift-ci-robot commented Oct 6, 2025

Uh oh!

openshift-ci-robot commented Oct 6, 2025

Uh oh!

isabella-janssen left a comment

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

pablintino Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

isabella-janssen Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

pablintino Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

isabella-janssen Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Oct 6, 2025

Uh oh!

sdodson commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

sdodson commented Oct 7, 2025

Uh oh!

openshift-cherrypick-robot commented Oct 7, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

openshift-cherrypick-robot commented Oct 7, 2025

Uh oh!

ptalgulk01 commented Oct 7, 2025

Uh oh!

Uh oh!

pablintino commented Oct 6, 2025 •

edited

Loading