-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
external: donot stop rbd and cephfs creation if rgw end. #2150
base: main
Are you sure you want to change the base?
Conversation
@parth-gr: This pull request references Bugzilla bug 2213757, which is valid. No validations were run on this bugNo GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/review @iamniting @travisn |
/backport 4.13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please change the commit to fit in the 73 char, Also please remove the bug id from the commit-msg, instead add it in the PR description. Having the bug id in the commit-msg creates confusion while backporting a PR.
We will need 2 bugs one for 4.14 and another one for 4.13. We can keep this one for 4.14 and clone it for 4.13.
// rgw-endpoint is no longer needed in the 'd.Data' dictionary, | ||
// and can be deleted | ||
// created an issue in rook to add `CephObjectStore` type directly in the JSON output | ||
// https://github.com/rook/rook/issues/6165 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// rgw-endpoint is no longer needed in the 'd.Data' dictionary, | |
// and can be deleted | |
// created an issue in rook to add `CephObjectStore` type directly in the JSON output | |
// https://github.com/rook/rook/issues/6165 | |
// rgw-endpoint is no longer needed in the 'd.Data' dictionary, | |
// and can be deleted | |
// created an issue in rook to add `CephObjectStore` type directly in the JSON output | |
// https://github.com/rook/rook/issues/6165 |
@parth-gr could look at this comment if it still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes look kinda an improvement for the code
It would need changes in both rook and oc-op
@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc | |||
} else if d.Name == cephRgwStorageClassName { | |||
rgwEndpoint := d.Data[externalCephRgwEndpointKey] | |||
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil { | |||
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint) | |||
return err | |||
r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.Log.Error(err, "RGW endpoint is not reachable. Will not create objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint) | |
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@subhamkrai: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: parth-gr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@parth-gr: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
64f3af5
to
70da3fc
Compare
Above one is valid for 4.14 once we have a backport PR I will clone the bug to 4.13 |
@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc | |||
} else if d.Name == cephRgwStorageClassName { | |||
rgwEndpoint := d.Data[externalCephRgwEndpointKey] | |||
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil { | |||
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint) | |||
return err | |||
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the RGW required to be online to create the rgw storage class? If the endpoint is just down temporarily, this means the storage class won't be created as expected. Why not still continue creating the storage class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice question!
Storage class needs the cephobjectstorename
as a parameter,
Parameters: map[string]string{
"objectStoreNamespace": initData.Namespace,
"region": "us-east-1",
"objectStoreName": generateNameForCephObjectStore(initData),
},
As we are using the objectstorename to create the SC so I think that would be also needed later for the obc creation internallly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as we don't send the error to the reconcile I am not sure if it will reconcile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's not reachable, the instance
and rgwEndpoint
parameters passed on line 360 are valid, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we proceed with this change, let's finalize the discussion on the BZ
@@ -354,20 +354,21 @@ func (r *StorageClusterReconciler) createExternalStorageClusterResources(instanc | |||
} else if d.Name == cephRgwStorageClassName { | |||
rgwEndpoint := d.Data[externalCephRgwEndpointKey] | |||
if err := checkEndpointReachable(rgwEndpoint, 5*time.Second); err != nil { | |||
r.Log.Error(err, "RGW endpoint is not reachable.", "RGWEndpoint", rgwEndpoint) | |||
return err | |||
r.Log.Error(err, "RGW endpoint is not reachable. Will skip creating objectstore and RGW Storage Class", "RGWEndpoint", rgwEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's not reachable, the instance
and rgwEndpoint
parameters passed on line 360 are valid, right?
extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint) | ||
if err != nil { | ||
return err | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err
is actually defined up on line 278, so this is a problem. We want to use the err
defined on line 356, but that is a different scope. I would suggest using an else
on line 358 instead of checking err == nil
on this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this for now
@travisn I think the approach we discussed yesterday for returning the err while creation of rgw Storage class, ocs-operator/controllers/storagecluster/reconcile.go Lines 416 to 429 in 867fd8f
I am not sure why this is not implemented with a go routine for different resources, which would make this as a parallel execution, @malayparida2000 any thoughts |
return err | ||
} | ||
extCephObjectStores, err = r.newExternalCephObjectStoreInstances(instance, rgwEndpoint) | ||
err := checkEndpointReachable(rgwEndpoint, 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we discussed to store the rgw error, and then fail the reconcile later around line 363?
err := checkEndpointReachable(rgwEndpoint, 5*time.Second) | |
rgwErr = checkEndpointReachable(rgwEndpoint, 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes take a look #2150 (comment)
… is not reachable the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it doesn't create other teo sc as well Signed-off-by: parth-gr <[email protected]>
The goroutine idea sounds promising. But we have to investigate if the resource creation can actually happen parallelly or they are done serially for a reason.
To me option 3 seems to be the right way. |
The best way the operator should work if returning the error, Here the more pointers from Travis https://bugzilla.redhat.com/show_bug.cgi?id=2213757#c9 So basically we would be holding error messages in the |
To correctly address the root cause we have to design a correct resource flow map identifying correctly which resources are independent of each other and can be created simultaneously and which need to happen in order. Looking at the implications of such a big change I think it's right to move this out of 4.14 which Travis has done. We have to bring this up in some discussions where we can have a consensus over this. Probably the odf bi-weekly call will be a good opportunity for this. I will request @parth-gr to please bring this topic up there. |
Agreed, let's discuss on the approach. A couple thoughts before then...
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
… is not reachable
the rbd cephfs and rgw creation is tightly coupled so if the rgw endpoint is not reachable it
doesn't create other to sc as well
So instead of returning the error just log it
BZ: Bug 2213757