OTA-1393: status: recognize the process of migration to multi-arch #1920

hongkailiu · 2024-11-21T02:56:30Z

openshift-ci-robot · 2024-11-21T02:56:33Z

@hongkailiu: This pull request references OTA-1393 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

/cc @wking @petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pkg/cli/admin/upgrade/status/controlplane.go

wking

One tiny nit, which you can lift the hold and merge unchanged if you like. Otherwise:

/lgtm

wking

/lgtm
/hold cancel

openshift-ci · 2024-11-21T08:04:27Z

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`7aebefa`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/security	`7aebefa`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller · 2024-11-21T12:01:19Z

/hold

I'll need to think a bit about our target state. This PR is an improvement but there are still items that are confusing: estimate and node states, for example.

hongkailiu · 2024-11-21T16:45:52Z

Testing result with 7aebefa

launch 4.18.0-ec.3 aws,amd64

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.

$ ./oc adm upgrade --to-multi-arch
Requested update to multi cluster architecture

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Update to 4.18.0-ec.3 successfully completed at 2024-11-21T15:43:55Z (duration: 39s)

All control plane nodes successfully updated to 4.18.0-ec.3

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100% (3/3)   3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-114-196.us-east-2.compute.internal   Completed    Updated   4.18.0-ec.3   -     
ip-10-0-16-27.us-east-2.compute.internal     Completed    Updated   4.18.0-ec.3   -     
ip-10-0-19-218.us-east-2.compute.internal    Completed    Updated   4.18.0-ec.3   -     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
39s     Info    None     Update is proceeding well

### ONE OR TWO MINS LATER ...

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      97% (32 operators updated, 0 updating, 1 waiting)
Duration:        1m32s (Est. Time Remaining: 1h10m)
Operator Health: 33 Healthy

All control plane nodes successfully updated to 4.18.0-ec.3

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100% (3/3)   3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-114-196.us-east-2.compute.internal   Completed    Updated   4.18.0-ec.3   -     
ip-10-0-16-27.us-east-2.compute.internal     Completed    Updated   4.18.0-ec.3   -     
ip-10-0-19-218.us-east-2.compute.internal    Completed    Updated   4.18.0-ec.3   -     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
1m32s   Info    None     Update is proceeding well

### After the upgrade is complete

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.

$ ./oc adm upgrade                                         
Cluster version is 4.18.0-ec.3

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

hongkailiu · 2024-11-21T18:00:11Z

items that are confusing: estimate and node states, for example.

I guess you mean those 2 lines. Let me think how I can make them better.

Duration:        6m55s (Est. Time Remaining: 1h4m)

All control plane nodes successfully updated to 4.18.0-ec.3

and "Worker Updated" too.

On one hand:
Some observation from my test:

All nodes (masters and workers) get rebooted during migration.
MCO does not claim the new image pull spec in co.status.versions[name="operator-image"] until all nodes gets rebooted.

Ideally we should modify "Updated" for the nodes somehow. The node status depends on MCP's annotation "machineconfiguration.openshift.io/release-image-version" which exposes no image pull spec.

This makes the estimated time of upgrade inaccurate as the node' status is wrong.

On the other hand:
@wking has also made a point that "all nodes gets rebooted" has nothing to do with "multi-arch readiness".

Maybe we do not display those information (node status and estimated time) for migration to multi-arch?
To me, not showing due to "i do not know" is better than showing wrong information.

Then we get them back when MCO implements another annotation like "machineconfiguration.openshift.io/operator-image".

petr-muller · 2024-11-22T17:48:42Z

pkg/cli/admin/upgrade/status/examples/to-multi-arch.detailed-output

@@ -0,0 +1,28 @@
+= Control Plane =
+Assessment:      Progressing
+Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)


This line is confusing (do we have the arch somewhere in CV to show in this case?)

petr-muller · 2024-11-22T17:52:50Z

pkg/cli/admin/upgrade/status/examples/to-multi-arch.detailed-output

+Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
+Updating:        machine-config
+Completion:      97% (32 operators updated, 1 updating, 0 waiting)
+Duration:        6m55s (Est. Time Remaining: 1h4m)


Est. Time. will be very incorrect - it shows the baseline duration (either 1h12m or previous update duration) for the first 10 minutes, and then starts projecting - and the projection will be wild, because the 97% completion will be reached so fast.

This is the least important of my concerns though, but I wonder if we could find a simple way to special-case this (detect the situation and just show N/A).

petr-muller · 2024-11-22T17:57:26Z

pkg/cli/admin/upgrade/status/examples/to-multi-arch.detailed-output

+machine-config   6m5s    -        Working towards 4.18.0-ec.3
+
+All control plane nodes successfully updated to 4.18.0-ec.3
+
+= Worker Upgrade =
+
+WORKER POOL   ASSESSMENT   COMPLETION   STATUS
+worker        Completed    100% (3/3)   2 Available, 0 Progressing, 0 Draining
+
+Worker Pool Nodes: worker
+NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
+ip-10-0-22-179.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is marked unschedulable
+ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -     
+ip-10-0-72-40.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -     


I'm concerned about the the claim that everything is Completed (individual nodes, worker pool assessment, control plane nodes...)

What is our ambition for this wrt. OTA-1393? Will there be a followup tto fix this or not?

I understand there may be reasons why it is hard to detect the transition for the individual node - then I think we need a card (OTA preferably) to fix this.

petr-muller · 2024-11-22T18:08:16Z

pkg/cli/admin/upgrade/status/controlplane.go

+		if operator.Name == "machine-config" {
+			for _, version := range operator.Status.Versions {
+				if version.Name == "operator-image" {
+					if targetImagePullSpec := getMCOImagePullSpec(mcoDeployment); targetImagePullSpec != version.Version {


The targetImagePullSpec := getMCOImagePullSpec(mcoDeployment) can be computed just once, we do not need to do that inside every for{ for{ iteration. So let's do it at the start of the function - and maybe even before that, and just pass the mcoImage inside as a string - which significantly simplifies its signature.

Reaching to MCO deployment is also quite a hack, so I think we should have a comment about why we do it this way.

petr-muller · 2024-11-22T18:29:13Z

LGTM - I'm fine with dealing with my review in a followup, I think it is better to have 4.18 oc with this, than without this. The whole functionality is TechPreview and we touch no GA code which warrants the ack label.

/label acknowledge-critical-fixes-only
/lgtm
/approve

openshift-ci · 2024-11-22T18:29:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cli/admin/upgrade/OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hongkailiu · 2024-11-22T19:09:15Z

/hold cancel

The comments will be addressed with follow-up PRs.

openshift-bot · 2024-11-23T01:54:09Z

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-cli
This PR has been included in build openshift-enterprise-cli-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-11-23T02:00:21Z

[ART PR BUILD NOTIFIER]

Distgit: ose-tools
This PR has been included in build ose-tools-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-11-23T02:06:19Z

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-deployer
This PR has been included in build openshift-enterprise-deployer-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

openshift-bot · 2024-11-23T02:45:20Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cli-artifacts
This PR has been included in build ose-cli-artifacts-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 21, 2024

openshift-ci bot requested review from petr-muller and wking November 21, 2024 02:56

wking reviewed Nov 21, 2024

View reviewed changes

pkg/cli/admin/upgrade/status/controlplane.go Outdated Show resolved Hide resolved

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2024

wking approved these changes Nov 21, 2024

View reviewed changes

openshift-ci bot assigned wking Nov 21, 2024

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 21, 2024

OTA-1393: status: recognize the process of migration to multi-arch

7aebefa

hongkailiu force-pushed the OTA-1393 branch from c43bef7 to 7aebefa Compare November 21, 2024 04:10

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2024

wking approved these changes Nov 21, 2024

View reviewed changes

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 21, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2024

petr-muller reviewed Nov 22, 2024

View reviewed changes

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Nov 22, 2024

openshift-ci bot assigned petr-muller Nov 22, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2024

openshift-merge-bot bot merged commit 9e56829 into openshift:master Nov 22, 2024
12 of 14 checks passed

This was referenced Nov 25, 2024

OTA-1393: status: compute mcoImagePullSpec only once #1924

Merged

OTA-1393: status: recognize multi-arch in target version #1926

Merged

OTA-1393: status: recognize multi-arch in node phase #1928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-1393: status: recognize the process of migration to multi-arch #1920

OTA-1393: status: recognize the process of migration to multi-arch #1920

hongkailiu commented Nov 21, 2024

openshift-ci-robot commented Nov 21, 2024 •

edited by openshift-ci bot

Loading

wking left a comment •

edited

Loading

wking left a comment

openshift-ci bot commented Nov 21, 2024

petr-muller commented Nov 21, 2024

hongkailiu commented Nov 21, 2024

hongkailiu commented Nov 21, 2024 •

edited

Loading

petr-muller Nov 22, 2024 •

edited

Loading

hongkailiu Nov 26, 2024

petr-muller Nov 22, 2024

hongkailiu Nov 26, 2024

petr-muller Nov 22, 2024

petr-muller Nov 22, 2024 •

edited

Loading

hongkailiu Nov 25, 2024

petr-muller commented Nov 22, 2024

openshift-ci bot commented Nov 22, 2024

hongkailiu commented Nov 22, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

OTA-1393: status: recognize the process of migration to multi-arch #1920

OTA-1393: status: recognize the process of migration to multi-arch #1920

Conversation

hongkailiu commented Nov 21, 2024

openshift-ci-robot commented Nov 21, 2024 • edited by openshift-ci bot Loading

wking left a comment • edited Loading

Choose a reason for hiding this comment

wking left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 21, 2024

petr-muller commented Nov 21, 2024

hongkailiu commented Nov 21, 2024

hongkailiu commented Nov 21, 2024 • edited Loading

petr-muller Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

hongkailiu Nov 26, 2024

Choose a reason for hiding this comment

petr-muller Nov 22, 2024

Choose a reason for hiding this comment

hongkailiu Nov 26, 2024

Choose a reason for hiding this comment

petr-muller Nov 22, 2024

Choose a reason for hiding this comment

petr-muller Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

hongkailiu Nov 25, 2024

Choose a reason for hiding this comment

petr-muller commented Nov 22, 2024

openshift-ci bot commented Nov 22, 2024

hongkailiu commented Nov 22, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

openshift-bot commented Nov 23, 2024

openshift-ci-robot commented Nov 21, 2024 •

edited by openshift-ci bot

Loading

wking left a comment •

edited

Loading

hongkailiu commented Nov 21, 2024 •

edited

Loading

petr-muller Nov 22, 2024 •

edited

Loading

petr-muller Nov 22, 2024 •

edited

Loading