-
Notifications
You must be signed in to change notification settings - Fork 448
OCPBUGS-62510: Skip rpm-ostree local rebase if no PIS #5333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit slightly changes the behaviour of OS updates if PIS is not configured. Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools. With this change that local check is only performed if PIS is enabled and configured. This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls. Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy. Co-authored-by: Isabella Janssen <[email protected]> Co-authored-by: Jerry Zhang <[email protected]>
@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@pablintino: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/jira refresh |
@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@pablintino: This pull request references Jira Issue OCPBUGS-62510, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm, no need for changes, just some questions inline
} | ||
|
||
func (dn *Daemon) isPinnedImageSetConfigured() (bool, error) { | ||
if dn.fgHandler == nil || !dn.fgHandler.Enabled(features.FeatureGatePinnedImages) || dn.node == nil || dn.mcpLister == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all versions that would need this fix has already GA'ed PIS, is the additional check here intended to allow for backporting to techpreview versions? Or just additional checks for consistency purposes (until such a time we remove the PIS featuregate entirely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the exact same thing today. In >4.19.12, does it make sense to preserve the feature gate checks of PIS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it is for consistency since the feature gate is technically still defined in the API. Sure, right now we are only putting this fix into versions where PIS is GA, but what if, for some reason we can't see today, someone wants to backport this further and just cherrypicks this PR? It seems Joel plans to raise bugs when it is time to remove feature gates, so I figured until that time we would keep FG checks for consistency (ref Slack message).
} | ||
|
||
// PIS is enabled. Check if it's configured in any of its pools | ||
pools, _, err := helpers.GetPoolsForNode(dn.mcpLister, dn.node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, re-reading the PIS implementation, I found it interesting that we used GetPoolsForNode
for general PIS detection (which I assume this is also doing for consistency), since it kind of implies there's some inheritance model in place (i.e. custom pools would count as workers for any PIS pulls, and master+worker nodes would also count as both) like MachineConfigs, which... might be intended? (if I understand correctly, I don't actually ever need any PIS for custom pools, unless for some reason they are using a different base image, but then they'd pull both custom and worker? Not sure if that's the intended workflow)
Not something we should worry about for this PR, just something I wanted to raise while we're here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprissed too, cause the initial drafy implementation I did used the primary pool but after reviewing what @isabella-janssen did I realized everything was using/considering all pools, unfortunately, that didn't trigger an alarm inside me and I changed the call to this one as it's what we use everywhere for PIS.
I think this conversation needs to be captured and properly handled, so I've opened a low prio bug. If the call to GetPoolsForNode
everywhere in the PIS code is the right one, we could close the bug with no action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a thread where this was discussed back in March: https://redhat-internal.slack.com/archives/C076EFZF40M/p1741706350105449. I think there is room to reconsider, especially given the concerns you raised in the bug @pablintino.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: isabella-janssen, pablintino, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/verified bypass |
@sdodson: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-4.20 |
@sdodson: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pablintino: Jira Issue Verification Checks: Jira Issue OCPBUGS-62510 Jira Issue OCPBUGS-62510 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@sdodson: new pull request created: #5340 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pre-merge tested:
Logs$ oc debug node/ppt-07-20a-m9tql-worker-eastus2-f8md4 -- chroot /host cat /etc/containers/policy.json Starting pod/ppt-07-20a-m9tql-worker-eastus2-f8md4-debug-ntmht ... To use host binaries, run `chroot /host` { "default": [ { "type": "reject" } ], "transports": { "atomic": { "ghcr.io": [ { "type": "insecureAcceptAnything" } ], "quay.io": [ { "type": "insecureAcceptAnything" } ], "registry-proxy.engineering.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.access.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.build10.ci.openshift.org": [ { "type": "insecureAcceptAnything" } ], "registry.ci.openshift.org": [ { "type": "insecureAcceptAnything" } ], "registry.connect.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.redhat.io": [ { "type": "insecureAcceptAnything" } ], "registry.stage.redhat.io": [ { "type": "insecureAcceptAnything" } ] }, "docker": { "ghcr.io": [ { "type": "insecureAcceptAnything" } ], "quay.io": [ { "type": "insecureAcceptAnything" } ], "registry-proxy.engineering.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.access.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.build10.ci.openshift.org": [ { "type": "insecureAcceptAnything" } ], "registry.ci.openshift.org": [ { "type": "insecureAcceptAnything" } ], "registry.connect.redhat.com": [ { "type": "insecureAcceptAnything" } ], "registry.redhat.io": [ { "type": "insecureAcceptAnything" } ], "registry.stage.redhat.io": [ { "type": "insecureAcceptAnything" } ] }, "docker-daemon": { "": [ { "type": "insecureAcceptAnything" } ] } } } Removing debug pod ...
$ oc adm upgrade --to-image registry.build10.ci.openshift.org/ci-ln-dp0svf2/release:latest --allow-explicit-upgrade --force
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest True False 16m Cluster version is 4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.21.0-0-2025-10-07-082602-test-ci-ln-dp0svf2-latest True False False 4h23m |
Closes: #OCPBUGS-62510
- What I did
This commit slightly changes the behaviour of OS updates if PIS is not configured.
Before this change, if PIS was enabled we checked if the new OS image was locally present and if so, the OS rebase was requested to be performed using the local stored copy, no matter if any PinnedImageSet was available for the node's pools.
With this change that local check is only performed if PIS is enabled and configured.
This minor behaviour change helps during upgrades from 4.19.10 to any version that has PIS enabled (from 4.19.12 it's enabled by default) as the machine-config-nodes-crd-cleanup job uses the target image to run before the update, catching the image locally and leading to possible pull/verify errors if the pull policy is not allowing local pulls.
Clusters with PIS configured won't benefit from this change if their pull policy is restrictive as this change scope doesn't cover tweaking the pull policy.
- How to verify it
Image
cluster
resource:/etc/containers/policy.json
has been updated and reflects the changes performed to theImage
resource.- Description for the changelog
This change prevents OS update failures by skipping the check for a local image if a PinnedImageSet (PIS) is enabled but not actively configured.