OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

schmts · 2024-06-13T10:56:57Z

I've been having troubles with okd since this morning. We're on 4.15.0-0.okd-2024-03-10-010116 and over night OKD began to think it's updating. No one in our team was doing anything since yesterday afternoon(when everything seemed fine) so I suspect something else might've gone wrong.

In The Cluster Settings tab it's reporting a failing update with the message:
"Multiple errors are preventing progress: * Cluster operator machine-config is not available * Cluster operators authentication, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, openshift-apiserver are degraded."

The "oc get mcp" is reporting the master machineconfigpool updating, but none degraded with 2 ready out of 3 machines. And when I look at the nodes themselves, one of the master nodes is with an "NotReady" status. I've also not managed to open a debug pod, to open a terminal or ssh into the node yet.

The config-policy-controller pod is in a CrashLoopBackOff state with logs reporting the following errors:

In the config-policy-controller container:
024-06-13T10:40:46.225Z error controller-runtime.source source/source.go:143 if kind is a CRD, it should be installed before calling Start {"kind": "OperatorPolicy.policy.open-cluster-management.io", "error": "no matches for kind "OperatorPolicy" in version "policy.open-cluster-management.io/v1beta1""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1

2024-06-13T10:42:46.226Z error setup app/main.go:511 Problem running manager {"error": "failed to wait for operator-policy-controller caches to sync: timed out waiting for cache to be synced"}
main.main.func5

And in the kube-rbac-proxy container:
I0613 10:22:07.400608 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews 201 Created in 9 milliseconds
I0613 10:22:07.404735 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews 201 Created in 3 milliseconds
2024/06/13 10:22:07 http: proxy error: dial tcp 127.0.0.1:8383: connect: connection refused

I've also noticed the machine config daemon on the affected node to be unreachable via terminal. I've deleted the pod with hopes of it picking up, but it's in a pending state right now.

titou10titou10 · 2024-06-13T13:26:12Z

it seems your node is in bad shape...did you tried to simply reboot the node?

schmts · 2024-06-14T08:34:12Z

Yep. A reboot helped. It might be that we're running out of resources and the node went awry.

JaimeMagiera closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

schmts commented Jun 13, 2024

titou10titou10 commented Jun 13, 2024

schmts commented Jun 14, 2024

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

Comments

schmts commented Jun 13, 2024

titou10titou10 commented Jun 13, 2024

schmts commented Jun 14, 2024