external: donot stop rbd and cephfs creation if rgw end. #2150

parth-gr · 2023-08-17T15:00:38Z

… is not reachable

the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it
doesn't create other to sc as well
So instead of returning the error just log it

BZ: Bug 2213757

openshift-ci · 2023-08-17T15:00:53Z

@parth-gr: This pull request references Bugzilla bug 2213757, which is valid.

No validations were run on this bug

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

Bug 2213757: external: donot stop rbd and cephfs creation if rgw end.…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

parth-gr · 2023-08-17T15:02:08Z

/review @iamniting @travisn

parth-gr · 2023-08-17T15:02:44Z

/backport 4.13

iamniting

Can you please change the commit to fit in the 73 char, Also please remove the bug id from the commit-msg, instead add it in the PR description. Having the bug id in the commit-msg creates confusion while backporting a PR.

We will need 2 bugs one for 4.14 and another one for 4.13. We can keep this one for 4.14 and clone it for 4.13.

subhamkrai · 2023-08-17T16:18:53Z

controllers/storagecluster/external_resources.go

+					// rgw-endpoint is no longer needed in the 'd.Data' dictionary,
+					// and can be deleted
+					// created an issue in rook to add `CephObjectStore` type directly in the JSON output
+					// https://github.com/rook/rook/issues/6165


Suggested change

// rgw-endpoint is no longer needed in the 'd.Data' dictionary,

// and can be deleted

// created an issue in rook to add `CephObjectStore` type directly in the JSON output

// https://github.com/rook/rook/issues/6165

// rgw-endpoint is no longer needed in the 'd.Data' dictionary,

// and can be deleted

// created an issue in rook to add `CephObjectStore` type directly in the JSON output

// https://github.com/rook/rook/issues/6165

@parth-gr could look at this comment if it still valid?

Yes look kinda an improvement for the code

It would need changes in both rook and oc-op

subhamkrai · 2023-08-17T16:19:51Z

controllers/storagecluster/external_resources.go

@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
 			} else if d.Name == cephRgwStorageClassName {
 				rgwEndpoint := d.Data[externalCephRgwEndpointKey]
 				if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
-					r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
-					return err
+					r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)


Suggested change

r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)

r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)

openshift-ci · 2023-08-17T16:19:59Z

@subhamkrai: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-08-17T16:20:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: parth-gr
Once this PR has been reviewed and has the lgtm label, please ask for approval from iamniting. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-08-17T16:50:41Z

@parth-gr: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

external: donot stop rbd and cephfs creation if rgw end.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

parth-gr · 2023-08-17T16:52:58Z

Can you please change the commit to fit in the 73 char, Also please remove the bug id from the commit-msg, instead add it in the PR description. Having the bug id in the commit-msg creates confusion while backporting a PR.

We will need 2 bugs one for 4.14 and another one for 4.13. We can keep this one for 4.14 and clone it for 4.13.

Above one is valid for 4.14 once we have a backport PR I will clone the bug to 4.13
Thanks

travisn · 2023-08-17T20:13:48Z

controllers/storagecluster/external_resources.go

@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
 			} else if d.Name == cephRgwStorageClassName {
 				rgwEndpoint := d.Data[externalCephRgwEndpointKey]
 				if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
-					r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
-					return err
+					r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)


Is the RGW required to be online to create the rgw storage class? If the endpoint is just down temporarily, this means the storage class won't be created as expected. Why not still continue creating the storage class?

Nice question!
Storage class needs the cephobjectstorename as a parameter,

Parameters: map[string]string{ "objectStoreNamespace": initData.Namespace, "region": "us-east-1", "objectStoreName": generateNameForCephObjectStore(initData), },

As we are using the objectstorename to create the SC so I think that would be also needed later for the obc creation internallly

And as we don't send the error to the reconcile I am not sure if it will reconcile

Even if it's not reachable, the instance and rgwEndpoint parameters passed on line 360 are valid, right?

Before we proceed with this change, let's finalize the discussion on the BZ

travisn · 2023-08-21T14:59:19Z

controllers/storagecluster/external_resources.go

@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc
 			} else if d.Name == cephRgwStorageClassName {
 				rgwEndpoint := d.Data[externalCephRgwEndpointKey]
 				if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil {
-					r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint)
-					return err
+					r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)


Even if it's not reachable, the instance and rgwEndpoint parameters passed on line 360 are valid, right?

travisn · 2023-08-21T15:01:19Z

controllers/storagecluster/external_resources.go

-				extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint)
-				if err != nil {
-					return err
+				if err == nil {


err is actually defined up on line 278, so this is a problem. We want to use the err defined on line 356, but that is a different scope. I would suggest using an else on line 358 instead of checking err == nil on this line.

Updated this for now

parth-gr · 2023-08-22T12:35:00Z

@travisn I think the approach we discussed yesterday for returning the err while creation of rgw Storage class,
Would be blocking this for loop which also creates other resources.

ocs-operator/controllers/storagecluster/reconcile.go

Lines 416 to 429 in 867fd8f

    
           	objs = []resourceManager{ 
        
           		&ocsExternalResources{}, 
        
           		&ocsStorageQuota{}, 
        
           		&ocsCephCluster{}, 
        
           		&ocsSnapshotClass{}, 
        
           		&ocsNoobaaSystem{}, 
        
           		&ocsClusterClaim{}, 
        
           	} 
        
           } 
        
           for _, obj := range objs { 
        
           	returnRes, returnErr := obj.ensureCreated(r, instance) 
        
           	if r.phase == statusutil.PhaseClusterExpanding { 
        
           		message := "StorageCluster is expanding"

I am not sure why this is not implemented with a go routine for different resources, which would make this as a parallel execution, @malayparida2000 any thoughts

travisn · 2023-08-22T13:48:58Z

controllers/storagecluster/external_resources.go

-					return err
-				}
-				extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint)
+				err := checkEndpointReachable(rgwEndpoint, 5*time.Second)


I thought we discussed to store the rgw error, and then fail the reconcile later around line 363?

Suggested change

err := checkEndpointReachable(rgwEndpoint, 5*time.Second)

rgwErr = checkEndpointReachable(rgwEndpoint, 5*time.Second)

Yes take a look #2150 (comment)

… is not reachable the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it doesn't create other teo sc as well Signed-off-by: parth-gr <[email protected]>

malayparida2000 · 2023-08-23T09:06:31Z

@travisn I think the approach we discussed yesterday for returning the err while creation of rgw Storage class, Would be blocking this for loop which also creates other resources.

ocs-operator/controllers/storagecluster/reconcile.go

Lines 416 to 429 in 867fd8f

objs = []resourceManager{

&ocsExternalResources{},

&ocsStorageQuota{},

&ocsCephCluster{},

&ocsSnapshotClass{},

&ocsNoobaaSystem{},

&ocsClusterClaim{},

}

}

for _, obj := range objs {

returnRes, returnErr := obj.ensureCreated(r, instance)

if r.phase == statusutil.PhaseClusterExpanding {

message := "StorageCluster is expanding"

I am not sure why this is not implemented with a go routine for different resources, which would make this as a parallel execution, @malayparida2000 any thoughts

The goroutine idea sounds promising. But we have to investigate if the resource creation can actually happen parallelly or they are done serially for a reason.
For now it seems like we have 3 options

Return err/requeue and try on reconciling till rgw end point is reachable.
(This will block the creation of other resources like the Cephcluster)
Even if endpoint is not reachable create the storageclass anyway
(This will create issues as even though the storage class is present it won't be able to provision)
Don't return err or requeue just skip out the rgw storageclass/cephobjectstore if the endpoint is unreachable
(Whenever there is another reconciliation {which we don't know when will happen}, it will check the status of the endpoint again. if it's available then create the objectstore & the rgw storageclass)

To me option 3 seems to be the right way.
@iamniting can you please also take a look here?

parth-gr · 2023-08-23T09:36:04Z

The best way the operator should work if returning the error,

Here the more pointers from Travis https://bugzilla.redhat.com/show_bug.cgi?id=2213757#c9
Which says we should re-work the creation of resources,
Either make them concurrent or at least not return the error till all the resources are called from the for loop,

So basically we would be holding error messages in the map and return the errors only after all the resources called ensure created

malayparida2000 · 2023-08-23T11:45:25Z

The best way the operator should work if returning the error,

Here the more pointers from Travis https://bugzilla.redhat.com/show_bug.cgi?id=2213757#c9 Which says we should re-work the creation of resources, Either make them concurrent or at least not return the error till all the resources are called from the for loop,

So basically we would be holding error messages in the map and return the errors only after all the resources called ensure created

To correctly address the root cause we have to design a correct resource flow map identifying correctly which resources are independent of each other and can be created simultaneously and which need to happen in order. Looking at the implications of such a big change I think it's right to move this out of 4.14 which Travis has done. We have to bring this up in some discussions where we can have a consensus over this. Probably the odf bi-weekly call will be a good opportunity for this. I will request @parth-gr to please bring this topic up there.

travisn · 2023-08-23T18:24:08Z

Agreed, let's discuss on the approach. A couple thoughts before then...

Go routines are best to avoid during reconcile for normal reconcile operations. The logging is harder to troubleshoot if multiple things are happening, and it also may require higher pod resource limits since the operator would burst more.
Being able to reconcile independent actions will be good, to allow collecting the errors and only failing after all actions have been attempted. Still, it is important to fail the reconcile and retry for the failures where resources could not be created.

openshift-merge-robot · 2024-05-16T00:17:04Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2023

iamniting requested changes Aug 17, 2023

View reviewed changes

openshift-ci bot assigned iamniting Aug 17, 2023

subhamkrai suggested changes Aug 17, 2023

View reviewed changes

parth-gr force-pushed the sc-fix-rgw branch from 474f9e2 to ba4e3ad Compare August 17, 2023 16:49

parth-gr requested a review from iamniting August 17, 2023 16:49

parth-gr changed the title ~~Bug 2213757: external: donot stop rbd and cephfs creation if rgw end.…~~ external: donot stop rbd and cephfs creation if rgw end. Aug 17, 2023

openshift-ci bot removed bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2023

parth-gr force-pushed the sc-fix-rgw branch 2 times, most recently from 64f3af5 to 70da3fc Compare August 17, 2023 16:51

parth-gr requested a review from subhamkrai August 17, 2023 16:53

travisn reviewed Aug 17, 2023

View reviewed changes

parth-gr requested a review from travisn August 21, 2023 14:47

travisn reviewed Aug 21, 2023

View reviewed changes

parth-gr force-pushed the sc-fix-rgw branch from 70da3fc to c0e4760 Compare August 22, 2023 12:39

parth-gr requested a review from travisn August 22, 2023 12:39

travisn reviewed Aug 22, 2023

View reviewed changes

Bug 2213757: external: donot stop rbd and cephfs creation if rgw end.…

25e0893

… is not reachable the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it doesn't create other teo sc as well Signed-off-by: parth-gr <[email protected]>

parth-gr force-pushed the sc-fix-rgw branch from c0e4760 to 25e0893 Compare August 22, 2023 14:34

parth-gr requested a review from travisn August 22, 2023 14:34

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external: donot stop rbd and cephfs creation if rgw end. #2150

external: donot stop rbd and cephfs creation if rgw end. #2150

parth-gr commented Aug 17, 2023 •

edited

Loading

openshift-ci bot commented Aug 17, 2023

parth-gr commented Aug 17, 2023 •

edited

Loading

parth-gr commented Aug 17, 2023 •

edited

Loading

iamniting left a comment

subhamkrai Aug 17, 2023

parth-gr Aug 17, 2023

subhamkrai Aug 17, 2023

parth-gr Aug 17, 2023

openshift-ci bot commented Aug 17, 2023

openshift-ci bot commented Aug 17, 2023

openshift-ci bot commented Aug 17, 2023

parth-gr commented Aug 17, 2023

travisn Aug 17, 2023

parth-gr Aug 21, 2023 •

edited

Loading

parth-gr Aug 21, 2023 •

edited

Loading

travisn Aug 21, 2023

travisn Aug 22, 2023

travisn Aug 21, 2023

travisn Aug 21, 2023

parth-gr Aug 22, 2023

parth-gr commented Aug 22, 2023 •

edited

Loading

travisn Aug 22, 2023

parth-gr Aug 22, 2023

malayparida2000 commented Aug 23, 2023 •

edited

Loading

parth-gr commented Aug 23, 2023 •

edited

Loading

malayparida2000 commented Aug 23, 2023

travisn commented Aug 23, 2023

openshift-merge-robot commented May 16, 2024

	r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)
	r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint)

	err := checkEndpointReachable(rgwEndpoint, 5*time.Second)
	rgwErr = checkEndpointReachable(rgwEndpoint, 5*time.Second)

external: donot stop rbd and cephfs creation if rgw end. #2150

Are you sure you want to change the base?

external: donot stop rbd and cephfs creation if rgw end. #2150

Conversation

parth-gr commented Aug 17, 2023 • edited Loading

openshift-ci bot commented Aug 17, 2023

parth-gr commented Aug 17, 2023 • edited Loading

parth-gr commented Aug 17, 2023 • edited Loading

iamniting left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 17, 2023

openshift-ci bot commented Aug 17, 2023

openshift-ci bot commented Aug 17, 2023

parth-gr commented Aug 17, 2023

Choose a reason for hiding this comment

parth-gr Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

parth-gr Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-gr commented Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malayparida2000 commented Aug 23, 2023 • edited Loading

parth-gr commented Aug 23, 2023 • edited Loading

malayparida2000 commented Aug 23, 2023

travisn commented Aug 23, 2023

openshift-merge-robot commented May 16, 2024

parth-gr commented Aug 17, 2023 •

edited

Loading

parth-gr commented Aug 17, 2023 •

edited

Loading

parth-gr commented Aug 17, 2023 •

edited

Loading

parth-gr Aug 21, 2023 •

edited

Loading

parth-gr Aug 21, 2023 •

edited

Loading

parth-gr commented Aug 22, 2023 •

edited

Loading

malayparida2000 commented Aug 23, 2023 •

edited

Loading

parth-gr commented Aug 23, 2023 •

edited

Loading