Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949

Closed
schmts opened this issue Jun 13, 2024 · 2 comments

Comments

@schmts
Copy link

schmts commented Jun 13, 2024

I've been having troubles with okd since this morning. We're on 4.15.0-0.okd-2024-03-10-010116 and over night OKD began to think it's updating. No one in our team was doing anything since yesterday afternoon(when everything seemed fine) so I suspect something else might've gone wrong.

In The Cluster Settings tab it's reporting a failing update with the message:
"Multiple errors are preventing progress: * Cluster operator machine-config is not available * Cluster operators authentication, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, openshift-apiserver are degraded."

The "oc get mcp" is reporting the master machineconfigpool updating, but none degraded with 2 ready out of 3 machines. And when I look at the nodes themselves, one of the master nodes is with an "NotReady" status. I've also not managed to open a debug pod, to open a terminal or ssh into the node yet.

The config-policy-controller pod is in a CrashLoopBackOff state with logs reporting the following errors:

In the config-policy-controller container:
024-06-13T10:40:46.225Z error controller-runtime.source source/source.go:143 if kind is a CRD, it should be installed before calling Start {"kind": "OperatorPolicy.policy.open-cluster-management.io", "error": "no matches for kind "OperatorPolicy" in version "policy.open-cluster-management.io/v1beta1""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1

2024-06-13T10:42:46.226Z error setup app/main.go:511 Problem running manager {"error": "failed to wait for operator-policy-controller caches to sync: timed out waiting for cache to be synced"}
main.main.func5

And in the kube-rbac-proxy container:
I0613 10:22:07.400608 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews 201 Created in 9 milliseconds
I0613 10:22:07.404735 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews 201 Created in 3 milliseconds
2024/06/13 10:22:07 http: proxy error: dial tcp 127.0.0.1:8383: connect: connection refused

I've also noticed the machine config daemon on the affected node to be unreachable via terminal. I've deleted the pod with hopes of it picking up, but it's in a pending state right now.

@titou10titou10
Copy link

it seems your node is in bad shape...did you tried to simply reboot the node?

@schmts
Copy link
Author

schmts commented Jun 14, 2024

Yep. A reboot helped. It might be that we're running out of resources and the node went awry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants