Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator takes down druid cluster upon re-creation #170

Closed
mariuskimmina opened this issue Jun 14, 2024 · 7 comments
Closed

Operator takes down druid cluster upon re-creation #170

mariuskimmina opened this issue Jun 14, 2024 · 7 comments

Comments

@mariuskimmina
Copy link
Contributor

mariuskimmina commented Jun 14, 2024

We currently have the druid-operator and the druid cluster in the same namespace.
We'll soon add a second druid cluster and to keep things cleaner we would like to move the operator to it's own namespace but while we were testing this we ran into a couple of issues.

First, here is how we imaged this should work:

  1. Take down the current operator
  2. All CRDs remain and the druid cluster stays alive
  3. Bring up the operator again in the new namespace
  4. The operator picks up the existing CRDs and ClusterRoles and things keep working
  5. No clusters get destroyed

Step 1 and 2 worked as expected, we could bring down the operator without affecting our existing druid cluster.
On step 3 we faced a couple of issues, first the operator did not support helms --skip-crd flag which prevented the new operator from coming up while the CRD already existed, a fix for this already got merged here

A similar issue then occurs for clusterRoles, we fixed this in our local chart by adding an option to skip the clusterRoles as well and of course we can also open a PR for this.

Now, with both of the above in place we were able to bring up the druid operator in it's own namespace but once the operator was up it somehow removed the existing CRD and because of the owner dependence the whole druid cluster was gone with it.
We are yet to figure out why exactly the CRD got removed. It also did not create a new one, we were left with a running operator but no druid cluster.

We saw the following events in our kubernetes cluster which seem to be related

druid             11m         Normal    DruidOperatorUpdateSuccess        druid/sb2                                                                        Updated [sb2:*v1alpha1.Druid].
druid             11m         Normal    DruidOperatorDeleteSuccess        druid/sb2                                                                        Successfully deleted object [data-volume-druid-sb2-middlemanagers-2:*v1.PersistentVolumeClaim] in namespace [druid]

That said, we haven't yet found the root cause of the operator removing the CRD

@AdheipSingh
Copy link
Contributor

AdheipSingh commented Jun 14, 2024

Operator will not mess up any current state unless something changes on the desired state ie the CR.

  • How do you apply configurations ? using helm ?
  • In any of your steps, did you uninstall operator helm chart ?
  • If CRD's are getting deleted, CR will get removed.
  • Operator won't delete an object , until a CR is marked as deletion.

I faced similar issue long back, as i had uninstalled the operator chart. But then we did add --keep-crds which worked fine.

DId you mark the CR for deletion, did helm do it ? i m curious. Also is your operator running cluster scope ?

@mariuskimmina
Copy link
Contributor Author

We are using Helm via Terraform

resource "helm_release" "druid_operator" {
  name       = "druid-operator"
  repository = "https://charts.datainfra.io"
  chart      = "druid-operator"
  namespace  = var.namespace
  version    = var.operator_chart_version

  set {
    name  = "resources.requests.cpu"
    value = var.operator_cpu_request
  }

  set {
    name  = "resources.requests.memory"
    value = var.operator_memory_request
  }

  set {
    name  = "resources.limits.memory"
    value = var.operator_memory_limit
  }
  set {
    name  = "env.WATCH_NAMESPACE"
    value = var.watch_namespace
  }
}

We are using watch_namespace to limit the operator to only the 2 namespaces we will have druid clusters in.

We first took down the current operator in the namespace druid by applying our druid module again, which doesn't contain the operator anymore

terraform apply --target module.druid

This removed the whole operator helm chart.
We then re-created the operator by applying it's new module

terraform apply --target module.druid_operator

This brought up the new operator successfully as described above.
Only once the new operator was running the existing CRD got removed (which resulted in the CR being removed)

@AdheipSingh
Copy link
Contributor

I am pretty sure, tf is re-creating CRD's. Try to use --keep-crds flag. Not sure where to add it in tf.

@mariuskimmina
Copy link
Contributor Author

I don't think that's the case, see we ran into #169 before - so we had a case where the druid operator was applied successfully, terraform was done, but unable to start because there was this exec format error.
While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.

@AdheipSingh
Copy link
Contributor

For such issues,
Ill suggest to get connected here https://calendly.com/adheip-singh/30-min-meeting?month=2024-06.

I am confused with terminology being used and mentioned

While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.

  • operator does not initate any CRD or CR. Its a controller, which looks on certain CRD and CR and reconciles.
  • operator won't remove any CRD, operator can only remove CR.
  • Operator can only remove CR, if deletion timestamp is set.

Please look into

  • who is responsible for applying configurations
  • who is responsible for reconciling configurations.

In your case its tf > helm for applying and operator for reconciling. There is an abstraction b/w the two points mentioned above.
If you send in a bad config operator will reconcile. So ill suggest to look into tf on what config it is applying.

Please note operator performs lookups for CRD. Operator does not perform any lookups for CR. Apply config for CR is totally an event driven mechanism. Operator won't delete any CR, until a deletion timestamp is set and Operator will never delete a CRD. The way you are applying configurations to CRD and CR is something to be looked into.

Regarding issue #169 , once i get time will push amd64 image.

@mariuskimmina
Copy link
Contributor Author

Small update on this before our meeting later:

We found that both the CRD and the CR do have a deletionTimestamp set after removing the helm chart of the operator. That said, the actual deletion only happens once the new (re-created) druid-operator starts running. We do have the keep-crds set to true (I don't think this does anything tho, the flag isn't used anywhere and Helm doesn't automatically delete CRDs in Helm 3. It did in Helm 2)

@TessaIO
Copy link
Collaborator

TessaIO commented Jun 24, 2024

@AdheipSingh we just tested the scenario we discussed in the meeting and Helm was the one responsible for deleting the CRD.
In fact, according to helm/helm#7279 (comment) if a helm release was deployed and the CRD was under the templates folder when you uninstall the Helm release, Helm would try to delete the CRD. It should be noted that we're installing the Helm chart before merging #162 so what happened makes sense.
Thanks for your time again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants