Loss of connection on overloaded nodes #187

kvaps · 2020-10-10T21:26:13Z

Hi I was already mentioned this problem in #141 (comment), we have many nodes in the single cluster, and sometimes some of them might be overloaded.

They are flapping between Online and OFFLINE state, some of them might stay OFFLINE until the linstor-controller restart, this cause fancy problems like #186 and piraeusdatastore/linstor-csi#89

After linstor-controller restart all the nodes become to Online and stay in this state for a while.

The text was updated successfully, but these errors were encountered:

tobg · 2020-10-13T09:09:20Z

Same issue here. k8s 1.18.9 drbd 9.0.25

raltnoeder · 2020-10-19T09:45:46Z

There are known problems in the reconnector that prevent the controller from reconnecting automatically.
Once the affected parts are redesigned and reimplemented, it should be able to reliably reconnect automatically. However, that will not solve the flapping on overloaded systems. If a system is so overloaded that it cannot answer requests in time, it is considered lost, which causes the connection to be dropped.

kvaps · 2020-11-13T09:41:48Z

Today we faced with this problem again, many resources were blocked by fact that they were trying to reach the node which was marked as OFFLINE:

╭────────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses                ┊ State   ┊
╞════════════════════════════════════════════════════════╡
┊ m8c24 ┊ SATELLITE ┊ 10.36.129.114:3367 (SSL) ┊ OFFLINE ┊
╰────────────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node  ┊ Port  ┊ Usage  ┊ Conns             ┊    State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9423-disk-0 ┊ m11c7 ┊ 55986 ┊ Unused ┊ Ok                ┊ UpToDate ┊                     ┊
┊ one-vm-9423-disk-0 ┊ m13c8 ┊ 55986 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:03:35 ┊
┊ one-vm-9423-disk-0 ┊ m8c24 ┊ 55986 ┊        ┊                   ┊  Unknown ┊                     ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯


╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node   ┊ Port  ┊ Usage  ┊ Conns             ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-9217-disk-0 ┊ m15c19 ┊ 55799 ┊ Unused ┊ Connecting(m8c24) ┊ Diskless ┊ 2020-11-13 08:01:47 ┊
┊ one-vm-9217-disk-0 ┊ m15c22 ┊ 55799 ┊ Unused ┊ Ok                ┊ UpToDate ┊                     ┊
┊ one-vm-9217-disk-0 ┊ m8c24  ┊ 55799 ┊        ┊                   ┊  Unknown ┊                     ┊
┊ one-vm-9217-disk-0 ┊ m8c9   ┊ 55799 ┊ Unused ┊ Ok                ┊ Diskless ┊ 2020-11-13 07:20:26 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

this node was after reboot

ErrorReports.tar.gz
linstor-controller.log
linstor-satellite.log

the last message of satellite log:

07:46:51.973 [SSLNetComService] ERROR LINSTOR/Satellite - SYSTEM - Unhandled IllegalStateException [Report number 5FAE330B-93455-000176]

that was me trying to test the connection, using:

telnet 10.36.129.114 3367

restart of the linstor-controller turned node back online, but not for the long time, now it is OFFLINE again

ErrorReports2.tar.gz
linstor-controller2.log
linstor-satellite2.log

kvaps · 2020-11-13T09:51:11Z

If I try to restart just satellite, I see that linstor-controller not even try to reconnect it:

linstor-satellite3.log

kvaps · 2020-11-13T09:56:01Z

Ah my bad, it seems last two cases we really have some connectivity issues with the node

kvaps · 2020-11-16T09:40:32Z

Today was exactly same situation, the node had lack of RAM, after the reboot the controller didn't try to reconnect the node until restart

╭───────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State   ┊
╞═══════════════════════════════════════════════════════╡
┊ m7c29 ┊ SATELLITE ┊ 10.36.129.74:3367 (SSL) ┊ OFFLINE ┊
╰───────────────────────────────────────────────────────╯

AntonSmolkov · 2021-05-19T21:10:15Z

@kvaps have you tried setting --kube-reserved on kubelet? It suppose to help to avoid overloading

kvaps · 2021-05-20T07:52:23Z

@AntonSmolkov Yes, it might solve the problem with the resources on the node, but it will not solve the fact that controller not tries to reconnect disconnected satellites.

kvaps · 2021-06-23T19:22:18Z

Looks like it is controller issue, I see the following picture quite often, many nodes changing their state to 'Connected', and only controller restart make it works again.

AntonSmolkov mentioned this issue May 25, 2021

SSLException: closing inbound before receiving peer's close_notify piraeusdatastore/piraeus-operator#181

Closed

kvaps mentioned this issue Aug 6, 2021

satellites turn into 'offline' frequently kvaps/kube-linstor#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of connection on overloaded nodes #187

Loss of connection on overloaded nodes #187

kvaps commented Oct 10, 2020

tobg commented Oct 13, 2020

raltnoeder commented Oct 19, 2020

kvaps commented Nov 13, 2020 •

edited

Loading

kvaps commented Nov 13, 2020 •

edited

Loading

kvaps commented Nov 13, 2020

kvaps commented Nov 16, 2020 •

edited

Loading

AntonSmolkov commented May 19, 2021

kvaps commented May 20, 2021 •

edited

Loading

kvaps commented Jun 23, 2021

Loss of connection on overloaded nodes #187

Loss of connection on overloaded nodes #187

Comments

kvaps commented Oct 10, 2020

tobg commented Oct 13, 2020

raltnoeder commented Oct 19, 2020

kvaps commented Nov 13, 2020 • edited Loading

kvaps commented Nov 13, 2020 • edited Loading

kvaps commented Nov 13, 2020

kvaps commented Nov 16, 2020 • edited Loading

AntonSmolkov commented May 19, 2021

kvaps commented May 20, 2021 • edited Loading

kvaps commented Jun 23, 2021

kvaps commented Nov 13, 2020 •

edited

Loading

kvaps commented Nov 13, 2020 •

edited

Loading

kvaps commented Nov 16, 2020 •

edited

Loading

kvaps commented May 20, 2021 •

edited

Loading