Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve conditions reporting #1605

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lburgazzoli
Copy link
Contributor

Description

Important

This PR is mainly intended to gather feedback on some work and decide how to move forward.

This work was intended to implement RHOAIENG-18216, so to make reconciliation errors be visible also in the status sub-resource of the various CRs, however it ended up being a larger chunk of work. In case we decide to move forward with this approach, the PR must be split in smaller, incremental PRs.

I did a little bit of research on how conditions are reported by other operator and I ended up taking a lot of inspiration from how the Knative operators are handling with conditions, so:

  • have a top level, aggregating conditions
  • define a set of dependent conditions that are contributing to the state of the top level one
  • have an option to define the severity of a conditions

So now, for every API that is implemented with the reconciler framework, we now have at least two conditions:

  • Ready which report the overall status of the resource
  • ProvisioningSucceeded which reports any error eventually happening during the reconciliation

If any component has additional conditions, then it can declare them (if not all the conditions will be taken into account).
So as an example, in the ModelRegistry component I now have:

_, err := reconciler.ReconcilerFor(mgr, &componentApi.ModelRegistry{}).
    # ...
    WithConditions(
	status.ConditionDeploymentsAvailable,
	status.ConditionServerlessAvailable).
    Build(ctx)

Which results in having its status populated with the following conditions:

apiVersion: components.platform.opendatahub.io/v1alpha1
kind: ModelRegistry
metadata:
  name: default-modelregistry
spec:
  registriesNamespace: odh-model-registries
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:10:32Z"
    message: 0/1 deployments ready
    reason: DeploymentsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:10:32Z"
    message: 0/1 deployments ready
    observedGeneration: 1
    reason: DeploymentsNotReady
    status: "False"
    type: DeploymentsAvailable
  - lastTransitionTime: "2025-02-03T12:55:45Z"
    observedGeneration: 1
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T12:55:32Z"
    status: "True"
    type: ServerlessAvailable

An the ready condition is reported a failing, because the DeploymentsAvailable conditions is not satisfied.

The DataScienceCluster has also been refactored to use the reconciler framework and as consequence behave the same, so it has a top level Ready condition, a ProvisioningSucceeded one, and a dedicated ComponentsReady condition that reports the overall status of the provisioned components (individual components conditions are still reported).

apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  # ...   
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:12:36Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:10:42Z"
    message: NoManagedComponents
    reason: NoManagedComponents
    severity: Info
    status: "True"
    type: ComponentsReady
  - lastTransitionTime: "2025-02-03T10:07:50Z"
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T10:07:25Z"
    message: Component ManagementState is set to Removed
    reason: Removed
    status: "False"
    type: CodeFlareReady
  - ...

Note

in this case the Ready condition is reported as being satisfied, even if the ComponentsReady is not. This is because the severity is marked as Info (the default is Error and it is being represented by an empty value)

This severity field can be useful to report some specific states, so as an example, the Kserve component would report the Ready condition as true, even if ServerlessAvailable and ServiceMeshAvailable are not (in this case because the component is configured explicitly to not use serverless)

apiVersion: components.platform.opendatahub.io/v1alpha1
kind: Kserve
  name: default-kserve
spec:
  # ...
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:20:06Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:20:06Z"
    status: "True"
    type: DeploymentsAvailable
  - lastTransitionTime: "2025-02-03T13:19:39Z"
    observedGeneration: 1
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T13:19:13Z"
    message: 'Serving management state is set to: Removed'
    reason: Removed
    severity: Info
    status: "False"
    type: ServerlessAvailable
  - lastTransitionTime: "2025-02-03T13:19:13Z"
    message: 'Serving management state is set to: Removed'
    reason: Removed
    severity: Info
    status: "False"
    type: ServiceMeshAvailable

Important

As part of this work, some other changes have been made:

  • make the DataScienceReconciler to use the reconciler frameworks
  • improve the gc action to offer more configurable options and usable also to remove non managed components
  • remove lastHeartBeatTime from DSC's conditions

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Feb 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lburgazzoli. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 27.55418% with 702 lines in your changes missing coverage. Please review.

Project coverage is 22.07%. Comparing base (28c36c2) to head (0d5bf02).

Files with missing lines Patch % Lines
pkg/services/gc/gc.go 0.00% 74 Missing ⚠️
...cecluster/datasciencecluster_controller_actions.go 0.00% 73 Missing ⚠️
...cecluster/datasciencecluster_controller_support.go 0.00% 49 Missing ⚠️
pkg/controller/conditions/conditions.go 72.57% 40 Missing and 8 partials ⚠️
...ers/components/kserve/kserve_controller_actions.go 0.00% 44 Missing ⚠️
pkg/controller/reconciler/reconciler.go 56.38% 37 Missing and 4 partials ⚠️
pkg/cluster/resources.go 0.00% 33 Missing ⚠️
pkg/controller/conditions/conditions_support.go 63.63% 20 Missing and 8 partials ⚠️
.../modelregistry/modelregistry_controller_actions.go 0.00% 26 Missing ⚠️
...pelines/datasciencepipelines_controller_actions.go 0.00% 24 Missing ⚠️
... and 36 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1605      +/-   ##
==========================================
+ Coverage   20.28%   22.07%   +1.78%     
==========================================
  Files         163      167       +4     
  Lines       11137    11522     +385     
==========================================
+ Hits         2259     2543     +284     
- Misses       8638     8712      +74     
- Partials      240      267      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lburgazzoli
Copy link
Contributor Author

/test all

return nil
}

slices.SortFunc(conditions, func(a, b common.Condition) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorting in SetCondition?

the sort here is probably a leftover of various iterations, I don't think it does make any sense here (probably also the sort done in the SetCondition is probably not needed)

Copy link

openshift-ci bot commented Feb 6, 2025

@lburgazzoli: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-e2e 1bc86a0 link true /test opendatahub-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

instance.Status.InstalledComponents = make(map[string]bool)
}

err := computeComponentsStatus(ctx, rr.Client, instance, cr.DefaultRegistry())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can probably can take whole rr with .Conditions to access them via Manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

notReadyComponents = append(notReadyComponents, component.GetName())
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably possible to create new Manager for ci's conditions and use it for access

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also thinking to extend the manager to be able to extract conditions from other managers, so you can have a sort of hierarchy. But I didn't want to make it too complicated for a POC.

I'll take this into account while splitting this PR in smaller chunks.

if !ok {
return fmt.Errorf("resource instance %v is not a componentApi.DataSciencePipelines", rr.Instance)
}
rr.Conditions.MarkTrue(status.ConditionArgoWorkflowAvailable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every such call modifies (and then remodifies) condition in the CR's status. May be some modification api to Condition object can be created and then one call to rr.Conditions.SetCondition() once?

if newCondition.LastTransitionTime.IsZero() {
newCondition.LastTransitionTime = metav1.NewTime(time.Now())
}
*conditions = append(*conditions, newCondition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are modifying in place, may be it makes sense to change ConditionsAccessor to return pointer as well? Then a bit less of SetConditions calls would be needed. Otherwise it probably should probably take the slice (not pointer) and return modified one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point, I did want to replace the explicit collection passing with a ConditionsAccessor instance to make things less verbose and confusing,

// eventually carrying it from an old implementation
newCondition.LastHeartbeatTime = nil

existingCondition := FindStatusCondition(*conditions, newCondition.Type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives us pointer to the condition itself and then all the different fields are replaced one by one. Why not to replace the whole structure?

I would probably even avoid the optimization and replaced uncoditionally. It would require to split Manager.SetCondition method to split with and without happiness to avoid infinite recursion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that code is a verbatim copy of the code shipped in the k8s.io/apimachinery/pkg/api/meta package, it must be refactored a little bit

@lburgazzoli lburgazzoli force-pushed the RHOAIENG-18216-available-condition branch from 4c0236a to 0d5bf02 Compare February 7, 2025 09:17
@lburgazzoli lburgazzoli force-pushed the RHOAIENG-18216-available-condition branch from 7657d4e to 1089d36 Compare February 7, 2025 17:10
@lburgazzoli
Copy link
Contributor Author

@ykaliuta made some changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

2 participants