-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve k8s dev cluster errors (control plane pod restarts) #7
Comments
This issue is also mentioned in an NCEAS computing repo issue |
Update - the k8s pods
The frequency of the warning messages increases to about 10 every few seconds when even a light processing load (i.e. assessment report jobs, 10 workers) is running. It may be that |
Typical kube-controller-manager log that occurs before restarting:
|
The kube-scheduler and kube-controller-manager restarts appear to be due to request timeouts. These services restart when a request they make times out, so that another replica can become the new leader. Even though we only have a one node control plane, these services will restart instead of attempting retry there request, as described in this issue: So this does not explain why the k8s api-server is not responding in time, but it does explain the restarts. It's possible that having a mult-node system could compensate for the underlying problem. The underlying problem has yet to be resolved. Here are some relevant log entries:
|
A test on k8s dev was made to attempt to resolve the restart issue, by increasing the
When this file is updated (i.e. via vi editor),
|
@gothub mentioned this on out weekly dev call today and I thought I'd drop a data point in from my single-node cluster running on another NCEAS VM:
More restarts than seems remotely reasonable. I see a lot of warnings from etcd about deadlocks and timeouts but none clearly indicate a restart. Haven't made a good effort at figuring it out though. |
Similar problem from the interwebs: https://platform9.com/kb/kubernetes/excessive-kubernetes-master-pod-restarts They say:
|
Good find. If I grep my etcd logs for "error", I get:
|
The dev k8s cluster is experiencing k8s system errors, with the kube-controller, kube-scheduler and etcd server experiencing many errors and restarts. Here is the pod, service info:
Here is a sample from the etcd log:
... and from kube-scheduler:
I'm not seeing error messages from kube-controller.
docker-dev-ucsb-1 has adequate disk space and the system is running at a load average of "0.82 0.92 1.05"
Note that the production cluster is not experiencing these errors or restarts.
The text was updated successfully, but these errors were encountered: