Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Essential checks introduction for hooks and groups in recipe #1795

Merged
merged 1 commit into from
Mar 3, 2025

Conversation

asn1809
Copy link
Member

@asn1809 asn1809 commented Feb 3, 2025

When using recipe, hooks and group can be either essential or not. It is defined either as group.Essential or hook.Essential. It is used in combination with workflow.FailOn.

The supported values for FailOn are:

  1. "any-error" -- Meaning for any of the error seen in the workflow steps mentioned in sequence, fail the workflow.
  2. "essential-error" -- Meaning if hook or group is marked as essential and any of essential steps fail, then the workflow fails.
  3. "full-error" -- Meaning if all the essential hooks or groups in the sequence fail, then only workflow fails. If any one of the essential passes, then the overall workflow should be marked as successful.

Fixes #1762
Fixes #1760

@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch 2 times, most recently from 0f931ff to c91198f Compare February 3, 2025 20:24
@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch from c91198f to 7f4602b Compare February 4, 2025 05:42
@asn1809 asn1809 marked this pull request as ready for review February 4, 2025 05:47
@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch 2 times, most recently from d94908e to 8ef861a Compare February 4, 2025 19:02
@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch from 8ef861a to 29fadac Compare February 7, 2025 18:33
@raghavendra-talur raghavendra-talur force-pushed the asn-hook-essential-fix branch 4 times, most recently from cf4362b to ec06119 Compare February 17, 2025 18:57
// to determine if the cluster data is protected. If it is nil, then it is
// considered as success. So we should set it as false here and set it as
// true if protection is not required or if protection is successful.
v.kubeObjectsCaptureFailed("KubeObjectsCaptureNotStarted", "Kube objects capture has not started")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is poorly named for this purpose. you should probably call directly kubeObjectsCaptureStatusFalse instead.


if isEssentialStep {
// shows that at least one essential step has succeeded
allEssentialStepsFailed = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is disabled for all captureSteps. In other words, if isEssentialSteps is always true, line 295 seems redundant. I'm not sure what the code is trying to determine.

@@ -1198,7 +1204,7 @@ func (v *VRGInstance) processAsPrimary() ctrl.Result {
v.log.Info("Kube objects restore failed", "error", err)
v.errorConditionLogAndSet(err, "Failed to restore kube objects", setVRGKubeObjectsErrorCondition)

return v.updateVRGStatus(v.result)
return v.updateVRGConditionsAndStatus(v.result)
}

v.log.Info("Kube objects restored")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the shouldRestoreClusterData pattern and update status here, before progressing with further processing (which includes a new backup and VRG upload conditionally).

As e2e reports an error where restore of k8s resources is triggered twice, at this point returning to update status to ensure no further processing is done would be better as suggested by @raghavendra-talur

Copy link
Member

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional LGTM based on addressing the early return as mentioned.

@raghavendra-talur raghavendra-talur force-pushed the asn-hook-essential-fix branch 2 times, most recently from 9f32517 to 924cf98 Compare February 19, 2025 12:37
@nirs
Copy link
Member

nirs commented Feb 19, 2025

  1. "full-error" -- Meaning if all the essential hooks or groups in the sequence fail, then only workflow fails. If any one of the essential passes, then the overall workflow should be marked as successful.

I don't understand this

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit is big and complicated and I don't fully understand it.

The second commit is smaller but I could not find the change described by the commit message. It looks like bunch of unrelated changes.

@@ -232,7 +232,7 @@ func backupDummyStatusProcessAndRestore(
restoreName string,
labels map[string]string,
) (*velero.Restore, error) {
backupStatusLog(backup, w.log)
backupStatusLog("restore", backup, w.log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to "initialize kubeObjectProtected condition to false"?

@@ -47,6 +49,7 @@ func captureWorkflowDefault(vrg ramen.VolumeReplicationGroup, ramenConfig ramen.

captureSpecs := []kubeobjects.CaptureSpec{
{
Name: DefaultCaptureRecoverGroupName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to "initialize kubeObjectProtected condition to false"?

@@ -210,52 +210,106 @@ func (v *VRGInstance) kubeObjectsCaptureStartOrResume(
captureNumber int64,
pathName, capturePathName, namePrefix, veleroNamespaceName string,
interval time.Duration,
labels map[string]string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

We seems to create new labels here, but it is not clear way this change is needed, and how it is related to supporting the new essential checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not related to essential checks but during the changes found that this refactoring would help reduce the arguments required as labels can be easily retrieved using util package using vrg instance.

allEssentialStepsFailed, err := v.executeCaptureSteps(result, pathName, capturePathName, namePrefix,
veleroNamespaceName, captureInProgressStatusUpdate, annotations, requests, log)
if err != nil {
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return silently on errors? is the error already logged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of error, vrg status is updated. If you suggest additional log, will add.

annotations := map[string]string{vrgGenerationKey: strconv.FormatInt(generation, vrgGenerationNumberBase)}
labels := util.OwnerLabels(v.instance)

allEssentialStepsFailed, err := v.executeCaptureSteps(result, pathName, capturePathName, namePrefix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do care about all essential steps failure and not about any essential failure? If something is essential than a failure should not be ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentiality of steps (group/hook) depends on the workflow.FailOn and is expected work in combination.
If FailOn is "any-error", we don't care whether step is essential or not and fail immediately.

If FailOn is "essential-error", we look for essential step and if it fails, workflow is marked as failure. If all essential steps pass, workflow is treated as successful.

If FailOn is "full-error", we look for essential steps. Here, only if all the essential steps fail, workflow is treated as failure. In case any one essential passes, workflow is treated as successful.

result, cg, pathName, capturePathName, namePrefix, veleroNamespaceName,
captureInProgressStatusUpdate,
labels, annotations, requests, log,
)
requestsCompletedCount += loopCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loopCount is the number of requests completed by the call to v.kubeObjectsGroupCapture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


return
return false, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return false here in all case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned here, in case failOn is "any-error", we should immediately stop the execution of the workflow.

@raghavendra-talur
Copy link
Member

@nirs Yes, I was trying to do too many things in this PR. Moved the unrelated commits to a new PR #1848 .

@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch 2 times, most recently from a842042 to 9ee584b Compare March 3, 2025 18:22
@asn1809 asn1809 force-pushed the asn-hook-essential-fix branch from 9ee584b to a0658f1 Compare March 3, 2025 18:32
@raghavendra-talur raghavendra-talur merged commit 484d5eb into RamenDR:main Mar 3, 2025
23 checks passed
@asn1809 asn1809 deleted the asn-hook-essential-fix branch March 4, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Verify the behaviour of workflows/failOn Essential parameter in the hook needs to be handled
5 participants