Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1393: status: recognize the process of migration to multi-arch #1920

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

hongkailiu
Copy link
Member

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 21, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 21, 2024

@hongkailiu: This pull request references OTA-1393 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

/cc @wking @petr-muller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from petr-muller and wking November 21, 2024 02:56
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2024
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny nit, which you can lift the hold and merge unchanged if you like. Otherwise:

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 21, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2024
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold cancel

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 21, 2024
Copy link
Contributor

openshift-ci bot commented Nov 21, 2024

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 7aebefa link false /test okd-scos-e2e-aws-ovn
ci/prow/security 7aebefa link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member

/hold

I'll need to think a bit about our target state. This PR is an improvement but there are still items that are confusing: estimate and node states, for example.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2024
@hongkailiu
Copy link
Member Author

Testing result with 7aebefa

launch 4.18.0-ec.3 aws,amd64

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.

$ ./oc adm upgrade --to-multi-arch
Requested update to multi cluster architecture

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Update to 4.18.0-ec.3 successfully completed at 2024-11-21T15:43:55Z (duration: 39s)

All control plane nodes successfully updated to 4.18.0-ec.3

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100% (3/3)   3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-114-196.us-east-2.compute.internal   Completed    Updated   4.18.0-ec.3   -     
ip-10-0-16-27.us-east-2.compute.internal     Completed    Updated   4.18.0-ec.3   -     
ip-10-0-19-218.us-east-2.compute.internal    Completed    Updated   4.18.0-ec.3   -     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
39s     Info    None     Update is proceeding well

### ONE OR TWO MINS LATER ...

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      97% (32 operators updated, 0 updating, 1 waiting)
Duration:        1m32s (Est. Time Remaining: 1h10m)
Operator Health: 33 Healthy

All control plane nodes successfully updated to 4.18.0-ec.3

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100% (3/3)   3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                         ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-114-196.us-east-2.compute.internal   Completed    Updated   4.18.0-ec.3   -     
ip-10-0-16-27.us-east-2.compute.internal     Completed    Updated   4.18.0-ec.3   -     
ip-10-0-19-218.us-east-2.compute.internal    Completed    Updated   4.18.0-ec.3   -     

= Update Health =
SINCE   LEVEL   IMPACT   MESSAGE
1m32s   Info    None     Update is proceeding well

### After the upgrade is complete

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
The cluster is not updating.

$ ./oc adm upgrade                                         
Cluster version is 4.18.0-ec.3

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

@hongkailiu
Copy link
Member Author

hongkailiu commented Nov 21, 2024

items that are confusing: estimate and node states, for example.

I guess you mean those 2 lines. Let me think how I can make them better.

Duration:        6m55s (Est. Time Remaining: 1h4m)

All control plane nodes successfully updated to 4.18.0-ec.3

and "Worker Updated" too.


On one hand:
Some observation from my test:

  • All nodes (masters and workers) get rebooted during migration.
  • MCO does not claim the new image pull spec in co.status.versions[name="operator-image"] until all nodes gets rebooted.

Ideally we should modify "Updated" for the nodes somehow. The node status depends on MCP's annotation "machineconfiguration.openshift.io/release-image-version" which exposes no image pull spec.

This makes the estimated time of upgrade inaccurate as the node' status is wrong.

On the other hand:
@wking has also made a point that "all nodes gets rebooted" has nothing to do with "multi-arch readiness".

Maybe we do not display those information (node status and estimated time) for migration to multi-arch?
To me, not showing due to "i do not know" is better than showing wrong information.

Then we get them back when MCO implements another annotation like "machineconfiguration.openshift.io/operator-image".

@@ -0,0 +1,28 @@
= Control Plane =
Assessment: Progressing
Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)
Copy link
Member

@petr-muller petr-muller Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is confusing (do we have the arch somewhere in CV to show in this case?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)
Updating: machine-config
Completion: 97% (32 operators updated, 1 updating, 0 waiting)
Duration: 6m55s (Est. Time Remaining: 1h4m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Est. Time. will be very incorrect - it shows the baseline duration (either 1h12m or previous update duration) for the first 10 minutes, and then starts projecting - and the projection will be wild, because the 97% completion will be reached so fast.

This is the least important of my concerns though, but I wonder if we could find a simple way to special-case this (detect the situation and just show N/A).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +11 to +24
machine-config 6m5s - Working towards 4.18.0-ec.3

All control plane nodes successfully updated to 4.18.0-ec.3

= Worker Upgrade =

WORKER POOL ASSESSMENT COMPLETION STATUS
worker Completed 100% (3/3) 2 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-22-179.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is marked unschedulable
ip-10-0-17-117.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
ip-10-0-72-40.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the the claim that everything is Completed (individual nodes, worker pool assessment, control plane nodes...)

What is our ambition for this wrt. OTA-1393? Will there be a followup tto fix this or not?

I understand there may be reasons why it is hard to detect the transition for the individual node - then I think we need a card (OTA preferably) to fix this.

if operator.Name == "machine-config" {
for _, version := range operator.Status.Versions {
if version.Name == "operator-image" {
if targetImagePullSpec := getMCOImagePullSpec(mcoDeployment); targetImagePullSpec != version.Version {
Copy link
Member

@petr-muller petr-muller Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The targetImagePullSpec := getMCOImagePullSpec(mcoDeployment) can be computed just once, we do not need to do that inside every for{ for{ iteration. So let's do it at the start of the function - and maybe even before that, and just pass the mcoImage inside as a string - which significantly simplifies its signature.

Reaching to MCO deployment is also quite a hack, so I think we should have a comment about why we do it this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petr-muller
Copy link
Member

LGTM - I'm fine with dealing with my review in a followup, I think it is better to have 4.18 oc with this, than without this. The whole functionality is TechPreview and we touch no GA code which warrants the ack label.

/label acknowledge-critical-fixes-only
/lgtm
/approve

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Nov 22, 2024
Copy link
Contributor

openshift-ci bot commented Nov 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hongkailiu
Copy link
Member Author

/hold cancel

The comments will be addressed with follow-up PRs.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 9e56829 into openshift:master Nov 22, 2024
12 of 14 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-cli
This PR has been included in build openshift-enterprise-cli-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-tools
This PR has been included in build ose-tools-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-deployer
This PR has been included in build openshift-enterprise-deployer-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-cli-artifacts
This PR has been included in build ose-cli-artifacts-container-v4.18.0-202411230007.p0.g9e56829.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants