Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of connection on overloaded nodes #187

Open
kvaps opened this issue Oct 10, 2020 · 9 comments
Open

Loss of connection on overloaded nodes #187

kvaps opened this issue Oct 10, 2020 · 9 comments

Comments

@kvaps
Copy link

kvaps commented Oct 10, 2020

Hi I was already mentioned this problem in #141 (comment), we have many nodes in the single cluster, and sometimes some of them might be overloaded.

They are flapping between Online and OFFLINE state, some of them might stay OFFLINE until the linstor-controller restart, this cause fancy problems like #186 and piraeusdatastore/linstor-csi#89

After linstor-controller restart all the nodes become to Online and stay in this state for a while.

@tobg
Copy link

tobg commented Oct 13, 2020

Same issue here. k8s 1.18.9 drbd 9.0.25

@raltnoeder
Copy link
Member

There are known problems in the reconnector that prevent the controller from reconnecting automatically.
Once the affected parts are redesigned and reimplemented, it should be able to reliably reconnect automatically. However, that will not solve the flapping on overloaded systems. If a system is so overloaded that it cannot answer requests in time, it is considered lost, which causes the connection to be dropped.

@kvaps
Copy link
Author

kvaps commented Nov 13, 2020

Today we faced with this problem again, many resources were blocked by fact that they were trying to reach the node which was marked as OFFLINE:

╭────────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses                ┊ State   ┊
╞════════════════════════════════════════════════════════╡
┊ m8c24 ┊ SATELLITE ┊ 10.36.129.114:3367 (SSL) ┊ OFFLINE ┊
╰────────────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node  ┊ Port  ┊ Usage  ┊ Conns             ┊    State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9423-disk-0 ┊ m11c7 ┊ 55986 ┊ Unused ┊ Ok                ┊ UpToDate ┊                     ┊
┊ one-vm-9423-disk-0 ┊ m13c8 ┊ 55986 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:03:35 ┊
┊ one-vm-9423-disk-0 ┊ m8c24 ┊ 55986 ┊        ┊                   ┊  Unknown ┊                     ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯


╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node   ┊ Port  ┊ Usage  ┊ Conns             ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9217-disk-0 ┊ m15c19 ┊ 55799 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:01:47 ┊
┊ one-vm-9217-disk-0 ┊ m15c22 ┊ 55799 ┊ Unused ┊ Ok                ┊ UpToDate ┊                     ┊
┊ one-vm-9217-disk-0 ┊ m8c24  ┊ 55799 ┊        ┊                   ┊  Unknown ┊                     ┊
┊ one-vm-9217-disk-0 ┊ m8c9   ┊ 55799 ┊ Unused ┊ Ok                ┊ Diskless ┊ 2020-11-13 07:20:26 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

this node was after reboot

ErrorReports.tar.gz
linstor-controller.log
linstor-satellite.log

the last message of satellite log:

07:46:51.973 [SSLNetComService] ERROR LINSTOR/Satellite - SYSTEM - Unhandled IllegalStateException [Report number 5FAE330B-93455-000176]

that was me trying to test the connection, using:

telnet 10.36.129.114 3367

restart of the linstor-controller turned node back online, but not for the long time, now it is OFFLINE again

ErrorReports2.tar.gz
linstor-controller2.log
linstor-satellite2.log

@kvaps
Copy link
Author

kvaps commented Nov 13, 2020

If I try to restart just satellite, I see that linstor-controller not even try to reconnect it:

linstor-satellite3.log

@kvaps
Copy link
Author

kvaps commented Nov 13, 2020

Ah my bad, it seems last two cases we really have some connectivity issues with the node

@kvaps
Copy link
Author

kvaps commented Nov 16, 2020

Today was exactly same situation, the node had lack of RAM, after the reboot the controller didn't try to reconnect the node until restart

╭───────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State   ┊
╞═══════════════════════════════════════════════════════╡
┊ m7c29 ┊ SATELLITE ┊ 10.36.129.74:3367 (SSL) ┊ OFFLINE ┊
╰───────────────────────────────────────────────────────╯

@AntonSmolkov
Copy link

@kvaps have you tried setting --kube-reserved on kubelet? It suppose to help to avoid overloading

@kvaps
Copy link
Author

kvaps commented May 20, 2021

@AntonSmolkov Yes, it might solve the problem with the resources on the node, but it will not solve the fact that controller not tries to reconnect disconnected satellites.

@kvaps
Copy link
Author

kvaps commented Jun 23, 2021

Looks like it is controller issue, I see the following picture quite often, many nodes changing their state to 'Connected', and only controller restart make it works again.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants