Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/kuberay] Missing Cluster Role rules causes Ray Service to be in WaitForServeDeploymentReady #30648

Open
frivas-at-navteca opened this issue Nov 27, 2024 · 4 comments · May be fixed by #30665
Assignees
Labels
kuberay tech-issues The user has a technical issue about an application triage Triage is needed

Comments

@frivas-at-navteca
Copy link

frivas-at-navteca commented Nov 27, 2024

Name and Version

bitnami/kuberay 1.2.19

What architecture are you using?

amd64

What steps will reproduce the bug?

This is my first issue so I hope I can provide all the information required for a better understand and troubleshooting. I might even be wrong with this, so please, bear with me.

Context: The infrastructure is just deployed from scratch using TF. All apps/services are up and running except a Kuberay Worker (more details below). Using the help provider I deployed kuberay-operator with a few custom values (show below) and I create a sample Ray Service using TF's kubectl provide to deploy the manifest (shown below too).

Kubernetes Cluster: AWS EKS
Helm:

  • Version: 3.16.2
  • Kuberay Chart: 1.2.19

Images:

  • Ray: bitnami/ray:2.38.0-debian-12-r2
  • Operator: bitnami/kuberay-operator:1.2.2-debian-12-r3

In this cluster I have deployed another apps/services using Bitnami's charts.

Deploy your Kubernetes cluster as usual. Use helm to install bitnami's kuberay. Deploy a RayService and check kuberay-operator logs, as well as the RayService.

Are you using any custom parameters or values?

The reason I am adding the rbac rules and the service account account token is related to the apparent issue I am seeing. The reason I am adding the RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is because in kuberay-operator's logs I see a message that states as the variable was not set it was using some other value, no biggie with this one, just explaining why I added that.

---
apiserver:
     enabled: false
cluster:
     enabled: false
operator:
     extraEnvVars:
          - name: RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV
            value: "300"
     rbac:
          rules:
               - apiGroups:
                    - ""
                 resources:
                    - endpoints
                 verbs:
                    - list
                    - watch
     serviceAccount:
          automountServiceAccountToken: true
livenessProbe:
     initialDelaySerconds: 300
     periodSeconds: 30
readinessProbe:
     initialDelaySerconds: 300
     periodSeconds: 30

The Ray Service I am using as an example is this one:

Note: I tried using version 2.39.0 as well just in case but the results are the same and as the Ray image being used by Bitnami's kuberay operator is 2.38 and it is advised to use the same one in the custom images, I created my app image using 2.38.

---
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-fake-emails
  namespace: kuberay
spec:
  serviceUnhealthySecondThreshold: 900
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2: |
    applications:
      - name: fake
        import_path: fake:app
        route_prefix: /
  rayClusterConfig:
    rayVersion: '2.38.0' # Should match Ray version in the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
            - name: ray-head
              image: fjrivas/custom_ray:latest
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 2
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: fjrivas/custom_ray:latest
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

What is the expected behavior?

The Ray Service in running state and no messages in kuberay-operator logs.

$ kg rayservice -n kuberay
NAME                     SERVICE STATUS   NUM SERVE ENDPOINTS
rayservice-fake-emails   Running          1

What do you see instead?

When I see the Ray Service it is stuck in WaitForServeDeploymentReady

$ kg rayservice -n kuberay
NAME                     SERVICE STATUS                NUM SERVE ENDPOINTS
rayservice-fake-emails   WaitForServeDeploymentReady
$ kd rayservice -n kuberay rayservice-fake-emails
Name:         rayservice-fake-emails
Namespace:    kuberay
Labels:       <none>
Annotations:  <none>
API Version:  ray.io/v1
Kind:         RayService
Metadata:
  Creation Timestamp:  2024-11-27T12:40:06Z
  Generation:          1
  Resource Version:    5499
  UID:                 a9221370-e409-4943-b0e0-77e3ff693c49
Spec:
  Deployment Unhealthy Second Threshold:  300
  Ray Cluster Config:
    Head Group Spec:
      Ray Start Params:
        Dashboard - Host:  0.0.0.0
      Template:
        Spec:
          Containers:
            Image:  fjrivas/custom_ray:latest
            Name:   ray-head
            Ports:
              Container Port:  6379
              Name:            gcs-server
              Protocol:        TCP
              Container Port:  8265
              Name:            dashboard
              Protocol:        TCP
              Container Port:  10001
              Name:            client
              Protocol:        TCP
              Container Port:  8000
              Name:            serve
              Protocol:        TCP
            Resources:
              Limits:
                Cpu:     2
                Memory:  2Gi
              Requests:
                Cpu:     2
                Memory:  2Gi
    Ray Version:         2.38.0
    Worker Group Specs:
      Group Name:    small-group
      Max Replicas:  2
      Min Replicas:  1
      Num Of Hosts:  1
      Ray Start Params:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Image:  fjrivas/custom_ray:latest
            Lifecycle:
              Pre Stop:
                Exec:
                  Command:
                    /bin/sh
                    -c
                    ray stop
            Name:  ray-worker
            Resources:
              Limits:
                Cpu:     1
                Memory:  2Gi
              Requests:
                Cpu:     500m
                Memory:  2Gi
  serveConfigV2:         applications:
  - name: fake
    import_path: fake:app
    route_prefix: /

  Service Unhealthy Second Threshold:  900
Status:
  Active Service Status:
    Ray Cluster Status:
      Desired CPU:              2500m
      Desired GPU:              0
      Desired Memory:           4Gi
      Desired TPU:              0
      Desired Worker Replicas:  1
      Endpoints:
        Client:        10001
        Dashboard:     8265
        Gcs - Server:  6379
        Metrics:       8080
        Serve:         8000
      Head:
        Pod IP:             10.1.36.167
        Pod Name:           rayservice-fake-emails-raycluster-dh9h2-head-6jq9d
        Service IP:         10.1.36.167
        Service Name:       rayservice-fake-emails-raycluster-dh9h2-head-svc
      Last Update Time:     2024-11-27T12:40:59Z
      Max Worker Replicas:  2
      Min Worker Replicas:  1
      Observed Generation:  1
  Observed Generation:      1
  Pending Service Status:
    Application Statuses:
      Fake:
        Health Last Update Time:  2024-11-27T12:41:27Z
        Serve Deployment Statuses:
          create_fake_email:
            Health Last Update Time:  2024-11-27T12:41:27Z
            Status:                   UPDATING
        Status:                       DEPLOYING
    Ray Cluster Name:                 rayservice-fake-emails-raycluster-dh9h2
    Ray Cluster Status:
      Desired CPU:     0
      Desired GPU:     0
      Desired Memory:  0
      Desired TPU:     0
      Head:
  Service Status:  WaitForServeDeploymentReady
Events:
  Type    Reason           Age                    From                   Message
  ----    ------           ----                   ----                   -------
  Normal  ServiceNotReady  7m6s (x25 over 7m54s)  rayservice-controller  The service is not ready yet. Controller will perform a round of actions in 2s.

I have read that the worker group is normal that is 0/1, in fact even under this conditions the app works.

$ kgpo -n kuberay
NAME                                                              READY   STATUS    RESTARTS   AGE
kuberay-operator-b5c75fd87-blwj6                                  1/1     Running   0          9m19s
rayservice-fake-emails-raycluster-dh9h2-head-6jq9d                1/1     Running   0          8m54s
rayservice-fake-emails-raycluster-dh9h2-small-grou-worker-l9s5w   0/1     Running   0          8m54s

I also see these in the kuberay-operator logs:

W1127 12:42:08.725162       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
E1127 12:42:08.725465       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
W1127 12:42:57.122692       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
E1127 12:42:57.122732       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
W1127 12:43:42.058024       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
E1127 12:43:42.058075       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
W1127 12:44:29.551260       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope
E1127 12:44:29.551308       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints" in API group "" at the cluster scope

Additional information

Stuff that I have tried:

  • Adding the rbac rules to the custom values.yaml

When I edit the cluster roles, adding the endpoints resource the Ray Service status changed to Running and the messages in the kuberay-operator log are no longer there.

What I did was:

$ kubectl edit clusterrole kuberay-kuberay-operator -n kuberay
...
  - apiGroups:
      - ""
    resources:
      - endpoints
    verbs:
      - list
      - watch
 ...

If there is in fact an issue and not my mistake adding these in the wrong place. The change will be in clusterrole.yaml adding the resources and verbs.

I have forked the project and, if this is in fact something to be fixed, I am ready to create a PR wit the solution described above.

I hope I am not missing anything.

Update 11/27/2024: I can see the required rules are in the Ray project chart helper

@frivas-at-navteca frivas-at-navteca added the tech-issues The user has a technical issue about an application label Nov 27, 2024
@github-actions github-actions bot added the triage Triage is needed label Nov 27, 2024
frivas-at-navteca added a commit to frivas-at-navteca/charts that referenced this issue Nov 27, 2024
@javsalgar
Copy link
Contributor

javsalgar commented Nov 28, 2024

Hi!

Thank you so much for reporting. If I understood correctly, it seems that the RBAC rules may not be in sync with some changes in upstream. Would you like to submit a PR adding the missing rules?

@frivas-at-navteca
Copy link
Author

Hello @javsalgar

Thank you very much! Sure I will create the PR. I hope it is the correct way to do it.

frivas-at-navteca added a commit to frivas-at-navteca/charts that referenced this issue Nov 29, 2024
frivas-at-navteca added a commit to frivas-at-navteca/charts that referenced this issue Nov 29, 2024
Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Dec 14, 2024
@frivas-at-navteca
Copy link
Author

Hello dear Bitnami team, this issue is still open and there is a PR to solve it. #30665

@github-actions github-actions bot removed the stale 15 days without activity label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kuberay tech-issues The user has a technical issue about an application triage Triage is needed
Projects
None yet
2 participants