-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss of connection on overloaded nodes #187
Comments
Same issue here. k8s 1.18.9 drbd 9.0.25 |
There are known problems in the reconnector that prevent the controller from reconnecting automatically. |
Today we faced with this problem again, many resources were blocked by fact that they were trying to reach the node which was marked as
this node was after reboot ErrorReports.tar.gz the last message of satellite log:
that was me trying to test the connection, using:
restart of the linstor-controller turned node back online, but not for the long time, now it is ErrorReports2.tar.gz |
If I try to restart just satellite, I see that linstor-controller not even try to reconnect it: |
Ah my bad, it seems last two cases we really have some connectivity issues with the node |
Today was exactly same situation, the node had lack of RAM, after the reboot the controller didn't try to reconnect the node until restart
|
@kvaps have you tried setting --kube-reserved on kubelet? It suppose to help to avoid overloading |
@AntonSmolkov Yes, it might solve the problem with the resources on the node, but it will not solve the fact that controller not tries to reconnect disconnected satellites. |
Hi I was already mentioned this problem in #141 (comment), we have many nodes in the single cluster, and sometimes some of them might be overloaded.
They are flapping between Online and OFFLINE state, some of them might stay OFFLINE until the linstor-controller restart, this cause fancy problems like #186 and piraeusdatastore/linstor-csi#89
After linstor-controller restart all the nodes become to
Online
and stay in this state for a while.The text was updated successfully, but these errors were encountered: