-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster not reachable #932
Comments
In the example daemonset I see that pods get deployed to masters and mount I think control plane becomes unreachable due to master nodes running out of disk space. Once there is no disk space left on a node, etcd stops running on that node. We run 3 etcd replicas on each master node: so when we run out of disk on 2 master nodes etcd loses quorum. When we lose etcd quorum we lose api server too. I prepared a reproducer for this:
|
@m1kola is that the right reproducer for this? Typically intensive customer workloads will never be scheduled on the masters. If they run out of disk space etcd is going to have issues, because etcd needs to write to disk to reach consensus. That's not specific to From what I can tell, nodes do not have
@mjudeikis can you clarify what we are looking to reproduce? |
@ehashman correct: It is not related to tmpfs and this can happen with any host mounted dir. I think there is not much we (or upstream) can do to protect from this. The only more or less reasonable thing is to have |
I have a suspicion if we ran a pod (In the example in a form of DS ) on each node, mount
/tmp
file system and produce big files, it might crash all control plane.This should not happen as individual pods should be limited in how much space they can consume from
tmpfs
.If this code/DS is running on the cluster http://git.bytheb.org/cgit/stap.git/
cluster is not usable after 1-2 days.
We need to do some debugging to understand why this is happening.
Potentially in cluster with local-RP,
persist=true
and some similar pod to produce a lot of dataThe text was updated successfully, but these errors were encountered: