node tuning: failed to list *v1.Job: Unauthorized #2287

adnankobir · 2024-12-19T16:35:22Z

What happened?

It appears that serviceaccount/tokens used by the cluster-node-setup daemonset are not refreshed after a certain period of time (in my case 106d) - I can see logs as follows:

I1219 16:21:32.848791       1 cache/reflector.go:325] Listing and watching *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:32.907213       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Pod: Unauthorized
E1219 16:21:32.907247       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
I1219 16:21:37.125898       1 cache/reflector.go:325] Listing and watching *v1.DaemonSet from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:37.134328       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.DaemonSet: Unauthorized
E1219 16:21:37.134376       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: Unauthorized
I1219 16:21:37.198647       1 cache/reflector.go:325] Listing and watching *v1.Job from k8s.io/[email protected]/tools/cache/reflector.go:229
W1219 16:21:37.207259       1 cache/reflector.go:539] k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Job: Unauthorized
E1219 16:21:37.207289       1 cache/reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Unauthorized

Verified that the RBAC is setup correctly - a simple restart of the daemonset resolves the issue.

This is problematic because scylla nodes will fail to startup as the associated nodeconfig configmap will be blocked:

❯ k get cm -n scylla-aud-events  nodeconfig-podinfo-5e0c3810-ec68-46fb-9ed2-6ef7a8c5daa4 -o yaml
apiVersion: v1
data:
  ScyllaRuntimeConfig: '{"containerID":"containerd://645a7b7cfcb354927eff327cce12296bd32f2720412b8365e3ca69ab8e17fec7","matchingNodeConfigs":["cluster"],"blockingNodeConfigs":["cluster"]}'

What did you expect to happen?

the cluster-node-setup daemonset should have pods that refresh their tokens correctly and be able to query the kubernetes API for it to function correctly.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a nodeConfig CR for e.g.:

apiVersion: scylla.scylladb.com/v1alpha1
kind: NodeConfig
metadata:
  name: cluster
spec:
  placement:
    nodeSelector:
      scylla.scylladb.com/node-type: scylla
    tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: scylla

leave it running for 100d+

Scylla Operator version

scylla-operator:1.13

Kubernetes platform name and version

❯ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.29.10-eks-7f9249a

Kubernetes platform info:
EKS

Please attach the must-gather archive.

NA

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tnozicka · 2024-12-20T08:38:40Z

Please attach the must-gather archive.

NA

The must-gather archive is a **mandatory** part of every bug report.
      See https://operator.docs.scylladb.com/stable/support/must-gather.html to learn how you can collect it.
      Do not edit the collected must-gather.

https://github.com/scylladb/scylla-operator/blob/b6e2ed7/.github/ISSUE_TEMPLATE/bug-report.yaml?plain=1#L57-L59

tnozicka · 2024-12-20T08:41:19Z

that said, we should likely double check how we wire the token in case it gets rotated

adnankobir added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2024

scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 19, 2024

tnozicka added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 20, 2024

scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node tuning: failed to list *v1.Job: Unauthorized #2287

node tuning: failed to list *v1.Job: Unauthorized #2287

adnankobir commented Dec 19, 2024

tnozicka commented Dec 20, 2024

tnozicka commented Dec 20, 2024

node tuning: failed to list *v1.Job: Unauthorized #2287

node tuning: failed to list *v1.Job: Unauthorized #2287

Comments

adnankobir commented Dec 19, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

Kubernetes platform name and version

Please attach the must-gather archive.

Anything else we need to know?

tnozicka commented Dec 20, 2024

tnozicka commented Dec 20, 2024