-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-1393: status: recognize the process of migration to multi-arch #1920
Conversation
@hongkailiu: This pull request references OTA-1393 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny nit, which you can lift the hold and merge unchanged if you like. Otherwise:
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold cancel
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/hold I'll need to think a bit about our target state. This PR is an improvement but there are still items that are confusing: estimate and node states, for example. |
Testing result with 7aebefa
$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.
$ ./oc adm upgrade --to-multi-arch
Requested update to multi cluster architecture
$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Update to 4.18.0-ec.3 successfully completed at 2024-11-21T15:43:55Z (duration: 39s)
All control plane nodes successfully updated to 4.18.0-ec.3
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Completed 100% (3/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-114-196.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
ip-10-0-16-27.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
ip-10-0-19-218.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
= Update Health =
SINCE LEVEL IMPACT MESSAGE
39s Info None Update is proceeding well
### ONE OR TWO MINS LATER ...
$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment: Progressing
Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)
Completion: 97% (32 operators updated, 0 updating, 1 waiting)
Duration: 1m32s (Est. Time Remaining: 1h10m)
Operator Health: 33 Healthy
All control plane nodes successfully updated to 4.18.0-ec.3
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Completed 100% (3/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-114-196.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
ip-10-0-16-27.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
ip-10-0-19-218.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
= Update Health =
SINCE LEVEL IMPACT MESSAGE
1m32s Info None Update is proceeding well
### After the upgrade is complete
$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.
$ ./oc adm upgrade
Cluster version is 4.18.0-ec.3
Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. |
I guess you mean those 2 lines. Let me think how I can make them better. Duration: 6m55s (Est. Time Remaining: 1h4m)
All control plane nodes successfully updated to 4.18.0-ec.3 and "Worker Updated" too. On one hand:
Ideally we should modify "Updated" for the nodes somehow. The node status depends on MCP's annotation "machineconfiguration.openshift.io/release-image-version" which exposes no image pull spec. This makes the estimated time of upgrade inaccurate as the node' status is wrong. On the other hand: Maybe we do not display those information (node status and estimated time) for migration to multi-arch? Then we get them back when MCO implements another annotation like "machineconfiguration.openshift.io/operator-image". |
@@ -0,0 +1,28 @@ | |||
= Control Plane = | |||
Assessment: Progressing | |||
Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is confusing (do we have the arch somewhere in CV to show in this case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3) | ||
Updating: machine-config | ||
Completion: 97% (32 operators updated, 1 updating, 0 waiting) | ||
Duration: 6m55s (Est. Time Remaining: 1h4m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Est. Time.
will be very incorrect - it shows the baseline duration (either 1h12m or previous update duration) for the first 10 minutes, and then starts projecting - and the projection will be wild, because the 97% completion will be reached so fast.
This is the least important of my concerns though, but I wonder if we could find a simple way to special-case this (detect the situation and just show N/A).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machine-config 6m5s - Working towards 4.18.0-ec.3 | ||
|
||
All control plane nodes successfully updated to 4.18.0-ec.3 | ||
|
||
= Worker Upgrade = | ||
|
||
WORKER POOL ASSESSMENT COMPLETION STATUS | ||
worker Completed 100% (3/3) 2 Available, 0 Progressing, 0 Draining | ||
|
||
Worker Pool Nodes: worker | ||
NAME ASSESSMENT PHASE VERSION EST MESSAGE | ||
ip-10-0-22-179.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is marked unschedulable | ||
ip-10-0-17-117.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - | ||
ip-10-0-72-40.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the the claim that everything is Completed
(individual nodes, worker pool assessment, control plane nodes...)
What is our ambition for this wrt. OTA-1393? Will there be a followup tto fix this or not?
I understand there may be reasons why it is hard to detect the transition for the individual node - then I think we need a card (OTA preferably) to fix this.
if operator.Name == "machine-config" { | ||
for _, version := range operator.Status.Versions { | ||
if version.Name == "operator-image" { | ||
if targetImagePullSpec := getMCOImagePullSpec(mcoDeployment); targetImagePullSpec != version.Version { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)
can be computed just once, we do not need to do that inside every for{ for{
iteration. So let's do it at the start of the function - and maybe even before that, and just pass the mcoImage
inside as a string - which significantly simplifies its signature.
Reaching to MCO deployment is also quite a hack, so I think we should have a comment about why we do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I'm fine with dealing with my review in a followup, I think it is better to have 4.18 /label acknowledge-critical-fixes-only |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel The comments will be addressed with follow-up PRs. |
9e56829
into
openshift:master
[ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-cli |
[ART PR BUILD NOTIFIER] Distgit: ose-tools |
[ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-deployer |
[ART PR BUILD NOTIFIER] Distgit: ose-cli-artifacts |
/cc @wking @petr-muller