Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external: donot stop rbd and cephfs creation if rgw end. #2150

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

parth-gr
Copy link
Member

@parth-gr parth-gr commented Aug 17, 2023

… is not reachable

the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it
doesn't create other to sc as well
So instead of returning the error just log it

BZ: Bug 2213757

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2023

@parth-gr: This pull request references Bugzilla bug 2213757, which is valid.

No validations were run on this bug

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

Bug 2213757: external: donot stop rbd and cephfs creation if rgw end.…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@parth-gr
Copy link
Member Author

parth-gr commented Aug 17, 2023

/review @iamniting @travisn

@parth-gr
Copy link
Member Author

parth-gr commented Aug 17, 2023

/backport 4.13

Copy link
Member

@iamniting iamniting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change the commit to fit in the 73 char, Also please remove the bug id from the commit-msg, instead add it in the PR description. Having the bug id in the commit-msg creates confusion while backporting a PR.

We will need 2 bugs one for 4.14 and another one for 4.13. We can keep this one for 4.14 and clone it for 4.13.

Comment on lines +364 to +367
// rgw-endpoint is no longer needed in the 'd.Data' dictionary,
// and can be deleted
// created an issue in rook to add `CephObjectStore` type directly in the JSON output
// https://github.com/rook/rook/issues/6165
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// rgw-endpoint is no longer needed in the 'd.Data' dictionary,
// and can be deleted
// created an issue in rook to add `CephObjectStore` type directly in the JSON output
// https://github.com/rook/rook/issues/6165
// rgw-endpoint is no longer needed in the 'd.Data' dictionary,
// and can be deleted
// created an issue in rook to add `CephObjectStore` type directly in the JSON output
// https://github.com/rook/rook/issues/6165

@parth-gr could look at this comment if it still valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes look kinda an improvement for the code

It would need changes in both rook and oc-op

@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
} else if d.Name == cephRgwStorageClassName {
rgwEndpoint := d.Data[externalCephRgwEndpointKey]
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
return err
r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2023

@subhamkrai: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: parth-gr
Once this PR has been reviewed and has the lgtm label, please ask for approval from iamniting. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@parth-gr parth-gr requested a review from iamniting August 17, 2023 16:49
@parth-gr parth-gr changed the title Bug 2213757: external: donot stop rbd and cephfs creation if rgw end.… external: donot stop rbd and cephfs creation if rgw end. Aug 17, 2023
@openshift-ci openshift-ci bot removed bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2023

@parth-gr: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

external: donot stop rbd and cephfs creation if rgw end.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@parth-gr parth-gr force-pushed the sc-fix-rgw branch 2 times, most recently from 64f3af5 to 70da3fc Compare August 17, 2023 16:51
@parth-gr
Copy link
Member Author

Can you please change the commit to fit in the 73 char, Also please remove the bug id from the commit-msg, instead add it in the PR description. Having the bug id in the commit-msg creates confusion while backporting a PR.

We will need 2 bugs one for 4.14 and another one for 4.13. We can keep this one for 4.14 and clone it for 4.13.

Above one is valid for 4.14 once we have a backport PR I will clone the bug to 4.13
Thanks

@parth-gr parth-gr requested a review from subhamkrai August 17, 2023 16:53
@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
} else if d.Name == cephRgwStorageClassName {
rgwEndpoint := d.Data[externalCephRgwEndpointKey]
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
return err
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the RGW required to be online to create the rgw storage class? If the endpoint is just down temporarily, this means the storage class won't be created as expected. Why not still continue creating the storage class?

Copy link
Member Author

@parth-gr parth-gr Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice question!
Storage class needs the cephobjectstorename as a parameter,

Parameters: map[string]string{
				"objectStoreNamespace": initData.Namespace,
				"region":               "us-east-1",
				"objectStoreName":      generateNameForCephObjectStore(initData),
			},

As we are using the objectstorename to create the SC so I think that would be also needed later for the obc creation internallly

Copy link
Member Author

@parth-gr parth-gr Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as we don't send the error to the reconcile I am not sure if it will reconcile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's not reachable, the instance and rgwEndpoint parameters passed on line 360 are valid, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we proceed with this change, let's finalize the discussion on the BZ

@parth-gr parth-gr requested a review from travisn August 21, 2023 14:47
@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
} else if d.Name == cephRgwStorageClassName {
rgwEndpoint := d.Data[externalCephRgwEndpointKey]
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
return err
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's not reachable, the instance and rgwEndpoint parameters passed on line 360 are valid, right?

extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint)
if err != nil {
return err
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is actually defined up on line 278, so this is a problem. We want to use the err defined on line 356, but that is a different scope. I would suggest using an else on line 358 instead of checking err == nil on this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this for now

@parth-gr
Copy link
Member Author

parth-gr commented Aug 22, 2023

@travisn I think the approach we discussed yesterday for returning the err while creation of rgw Storage class,
Would be blocking this for loop which also creates other resources.

objs = []resourceManager{
&ocsExternalResources{},
&ocsStorageQuota{},
&ocsCephCluster{},
&ocsSnapshotClass{},
&ocsNoobaaSystem{},
&ocsClusterClaim{},
}
}
for _, obj := range objs {
returnRes, returnErr := obj.ensureCreated(r, instance)
if r.phase == statusutil.PhaseClusterExpanding {
message := "StorageCluster is expanding"

I am not sure why this is not implemented with a go routine for different resources, which would make this as a parallel execution, @malayparida2000 any thoughts

return err
}
extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint)
err := checkEndpointReachable(rgwEndpoint, 5*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we discussed to store the rgw error, and then fail the reconcile later around line 363?

Suggested change
err := checkEndpointReachable(rgwEndpoint, 5*time.Second)
rgwErr = checkEndpointReachable(rgwEndpoint, 5*time.Second)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes take a look #2150 (comment)

… is not reachable

the rbd cephfs and rgw creation is tightly coupled
so if the rgw endpoint is not reachable it
doesn't create other teo sc as well

Signed-off-by: parth-gr <[email protected]>
@malayparida2000
Copy link
Contributor

malayparida2000 commented Aug 23, 2023

@travisn I think the approach we discussed yesterday for returning the err while creation of rgw Storage class, Would be blocking this for loop which also creates other resources.

objs = []resourceManager{
&ocsExternalResources{},
&ocsStorageQuota{},
&ocsCephCluster{},
&ocsSnapshotClass{},
&ocsNoobaaSystem{},
&ocsClusterClaim{},
}
}
for _, obj := range objs {
returnRes, returnErr := obj.ensureCreated(r, instance)
if r.phase == statusutil.PhaseClusterExpanding {
message := "StorageCluster is expanding"

I am not sure why this is not implemented with a go routine for different resources, which would make this as a parallel execution, @malayparida2000 any thoughts

The goroutine idea sounds promising. But we have to investigate if the resource creation can actually happen parallelly or they are done serially for a reason.
For now it seems like we have 3 options

  1. Return err/requeue and try on reconciling till rgw end point is reachable.
    (This will block the creation of other resources like the Cephcluster)
  2. Even if endpoint is not reachable create the storageclass anyway
    (This will create issues as even though the storage class is present it won't be able to provision)
  3. Don't return err or requeue just skip out the rgw storageclass/cephobjectstore if the endpoint is unreachable
    (Whenever there is another reconciliation {which we don't know when will happen}, it will check the status of the endpoint again. if it's available then create the objectstore & the rgw storageclass)

To me option 3 seems to be the right way.
@iamniting can you please also take a look here?

@parth-gr
Copy link
Member Author

parth-gr commented Aug 23, 2023

The best way the operator should work if returning the error,

Here the more pointers from Travis https://bugzilla.redhat.com/show_bug.cgi?id=2213757#c9
Which says we should re-work the creation of resources,
Either make them concurrent or at least not return the error till all the resources are called from the for loop,

So basically we would be holding error messages in the map and return the errors only after all the resources called ensure created

@malayparida2000
Copy link
Contributor

The best way the operator should work if returning the error,

Here the more pointers from Travis https://bugzilla.redhat.com/show_bug.cgi?id=2213757#c9 Which says we should re-work the creation of resources, Either make them concurrent or at least not return the error till all the resources are called from the for loop,

So basically we would be holding error messages in the map and return the errors only after all the resources called ensure created

To correctly address the root cause we have to design a correct resource flow map identifying correctly which resources are independent of each other and can be created simultaneously and which need to happen in order. Looking at the implications of such a big change I think it's right to move this out of 4.14 which Travis has done. We have to bring this up in some discussions where we can have a consensus over this. Probably the odf bi-weekly call will be a good opportunity for this. I will request @parth-gr to please bring this topic up there.

@travisn
Copy link
Contributor

travisn commented Aug 23, 2023

Agreed, let's discuss on the approach. A couple thoughts before then...

  • Go routines are best to avoid during reconcile for normal reconcile operations. The logging is harder to troubleshoot if multiple things are happening, and it also may require higher pod resource limits since the operator would burst more.
  • Being able to reconcile independent actions will be good, to allow collecting the errors and only failing after all actions have been attempted. Still, it is important to fail the reconcile and retry for the failures where resources could not be created.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants