Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dapr-control-plane OOMKilled when DaprInstance provisioned #135

Closed
ryorke1 opened this issue Apr 18, 2024 · 11 comments
Closed

BUG: dapr-control-plane OOMKilled when DaprInstance provisioned #135

ryorke1 opened this issue Apr 18, 2024 · 11 comments

Comments

@ryorke1
Copy link

ryorke1 commented Apr 18, 2024

Expected Behavior

dapr-control-plane pod should remain stable and have configurable resource limits and requests.

Current Behavior

The dapr-control-plane pod is continuously being OOMKIlled as long as there is a DaprInstance created. If we remove the DaprInstance, the pod stablizes. The dapr-control-plane pod does seem to survive long enough to deploy the DaprInstance pods and CRDs but it takes a few OOMKills to complete. The pod still continues to crash but doesn't seem to affect the Dapr components.

Possible Solution

  1. Increase the resource limits to 512Mi (Memory) and 1000m (CPU)
  2. Make the resource limits and request configurable

Steps to Reproduce

  1. Uninstall any previous version of Dapr Operator (including cleaning up all CRDs and CRs)
  2. Install Dapr Operator 0.0.8 (at this point the dapr-control-plane will start and is stable)
  3. Create a new DaprInstance with the following configuration (see below)
  4. Monitor the pods and watch the dapr-control-plane pod get OOMKilled
# DaprInstance 
apiVersion: operator.dapr.io/v1alpha1
kind: DaprInstance
metadata:
  name: dapr-instance
  namespace: openshift-operators
spec:
  values:
    dapr_operator:
      livenessProbe:
        initialDelaySeconds: 10
      readinessProbe:
        initialDelaySeconds: 10
    dapr_placement:
      cluster:
        forceInMemoryLog: true
    global:
      imagePullSecrets: dapr-pull-secret
      registry: internal-repo/daprio
  chart:
    version: 1.13.2

Environment

OpenShift: RedHad OpenShift Container Platform 4.12
Dapr Operator: 0.0.8 with 1.13.2 Dapr components

@lburgazzoli
Copy link
Collaborator

To change the resource la request and limits, the only option is to tweak the subscription #77 (comment)

unfortunately the memory cannot be made configurable but I will digg into the memory consumption.

Do you have a way to reproduce it ? I never experienced such behavior

@ryorke1
Copy link
Author

ryorke1 commented Apr 18, 2024

All we did was execute the steps above and that reproduced it. I don't think the dapr-control-plane would be affected by any existing pods that had dapr annotations for sidecar injection but maybe you can correct me if I am wrong. We did have a number of pods running that had the annotations during the initialization of the DaprInstance.

Do you have an example of how we could use the subscription to tweak the requests and limits in the context of the dapr-control-plane? Or am I mistaken about what you mean?

@lburgazzoli
Copy link
Collaborator

All we did was execute the steps above and that reproduced it. I don't think the dapr-control-plane would be affected by any existing pods that had dapr annotations for sidecar injection but maybe you can correct me if I am wrong. We did have a number of pods running that had the annotations during the initialization of the DaprInstance.

it should not as the one that is affected is the dapr-operator and other resources. The dapr control plane only generates the manifest. Maybe the watcher watches too many objects. I'll have a look.

Do you have an example of how we could use the subscription to tweak the requests and limits in the context of the dapr-control-plane? Or am I mistaken about what you mean?

No, I don't but there are a number of examples in the documentation mentioned in the linked comment.

@lburgazzoli
Copy link
Collaborator

lburgazzoli commented Apr 19, 2024

I've tried to reproduce the issue but I've failed.
What I did is:

  • delete any trace of the dapr-kubernetes-operator
  • reinstall the operator
  • deploy a DaprInstance resource similar to the one you provided (excetpt the registry)

But the operator works as expected and does not get OOMKilled:

➜ k get pods -l control-plane=dapr-control-plane -w
NAME                                  READY   STATUS    RESTARTS   AGE
dapr-control-plane-7796c9ff85-htk4g   1/1     Running   0          2m49s
➜ k top pod dapr-control-plane-7796c9ff85-htk4g    
NAME                                  CPU(cores)   MEMORY(bytes)   
dapr-control-plane-7796c9ff85-htk4g   7m           68Mi           

I don't have any dapr application running so it is not 100% the same test, but for what concern the dapr-kubernetes-operator, it should not matter.

@ryorke1
Copy link
Author

ryorke1 commented Apr 19, 2024

OK we are going to look into the OLM and see if we can adjust the resources of the dapr-control-plane. While we are doing that, I am curious to know if the dapr-control-plane being killed will cause any issues? IN our case, so far we do see the components in places and the CRDS were deployed (permission issues still exists #136 ) and we are using the dapr components so far without issues. What's your thoughts on this?

@ryorke1
Copy link
Author

ryorke1 commented Apr 19, 2024

Also, was finally able to capture a screenshot of this crash (it goes OOMKilled and then immediately into CrashBackoffLoop so hard to capture as well).

image

@ryorke1
Copy link
Author

ryorke1 commented Apr 19, 2024

Some logs from OpenShift as well

image

@lburgazzoli
Copy link
Collaborator

OK we are going to look into the OLM and see if we can adjust the resources of the dapr-control-plane. While we are doing that, I am curious to know if the dapr-control-plane being killed will cause any issues? IN our case, so far we do see the components in places and the CRDS were deployed (permission issues still exists #136 ) and we are using the dapr components so far without issues. What's your thoughts on this?

It should jot cause any issue as the role of the operator ia just to setup dapr and be sure the setup is in sync with the DaprInstance spec

@lburgazzoli
Copy link
Collaborator

Some logs from OpenShift as well

image

are you able to provide a reproducer ? like by deploying a DaprInstance similar to your one does not trigger OOMKiller on my environment so I need something similar to your setup to digg onto it further

@ryorke1
Copy link
Author

ryorke1 commented Apr 22, 2024

Hi @lburgazzoli. Using subscriptions in OLM we were able to stabilize the dapr-control-plane pod. Here is the subscription we used for future reference if others run into this issue.

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/dapr-kubernetes-operator.openshift-operators: ""
  name: dapr-kubernetes-operator
  namespace: openshift-operators
spec:
  channel: alpha
  config:
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 256Mi
  installPlanApproval: Manual
  name: dapr-kubernetes-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: dapr-kubernetes-operator.v0.0.8

As a side note, this did not resolve the propagation to the roles. We still need a admin to manually create roles for us to use these CRDs.

@ryorke1 ryorke1 closed this as completed Apr 22, 2024
@lburgazzoli
Copy link
Collaborator

@ryorke1 I would really love to be able to reproduce it so I can fix the real problem (which maybe it is just about increasing the memory) so if at any point you have a sort of reproducer, please let me knoe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants