Check if we can connect to the remote node before doing connect_node #44

arcusfelis · 2023-11-29T16:45:35Z

There is a situation during upgrades:

node is about to restart.
it closes one of connections to the remote node.
it immediately gets contacted again.
not all nodes would be able to connect.
we see node_reconnects warning and global complaining about overlapped networks.
So, we got nodedown, nodeup, nodedown very quickly instead of just getting nodedown in this situation.

This PR solves node_reconnects warning by ensuring we only call net_kernel:connect_node if still could connect to the node.
Global overlapped warning could still be here, because it is triggered by first disconnect.

Does it increases work we need to do?

a bit. Usually the check would return earlier because of lists:member(Node, nodes()) returning true.
handling nodedown though is very expensive operation - we need to cleanup ETS tables. so it is good to avoid it even by creating testing connections when trying to connect for the first time.

It also would protect us from trying to connect to the node which is dead, but still in DB.

Initial version of the fix had can_connect in mongoose_epmd. There are things to consider:

doing the check in mongoose_epmd would block net_kernel. This would block opening debug shell.
doing this check in cets_ping would only block cets_join (and cets_discovery ping call, but this one is async) - which we are perfectly fine with.
Only cets_discovery makes brand new connecting attempts.
Global does connects because of connect_all - but it is first triggered by cets_discovery.
This code does not guarantee that the following connect_node would succeed. It just improves the odds.

This PR fixes level=warning what=node_reconnects in logs during upgrade:

when=2023-11-30T08:55:51.462816+00:00 level=error what=task_failed reason=noproc pid=<0.18856.0> at=cets_long:monitor_loop/5:98 reported_by=monitor_process caller_pid=<0.18851.0> task=cets_wait_for_get_nodes
when=2023-11-30T08:55:51.520335+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=172 time_since_startup_in_milliseconds=661646 connected_nodes=7 remote_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.525983+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=162 time_since_startup_in_milliseconds=661651 connected_nodes=8 remote_node=mongooseim@mongooseim-7.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.538397+00:00 level=warning what=node_reconnects pid=<0.428.0> at=cets_discovery:handle_receive_start_time/3:593 text="Netsplit recovery. The remote node has been connected to us before." start_time=1701334441099 remote_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.551818+00:00 level=warning what=node_reconnects pid=<0.428.0> at=cets_discovery:handle_receive_start_time/3:593 text="Netsplit recovery. The remote node has been connected to us before." start_time=1701334390440 remote_node=mongooseim@mongooseim-7.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.568547+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=197 time_since_startup_in_milliseconds=661694 connected_nodes=9 remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.614467+00:00 level=warning what=cleaner_nodedown pid=<0.425.0> at=mongoose_cleaner:handle_info/2:55 text="mongoose_cleaner received nodenown event" down_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.619700+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=271 time_since_startup_in_milliseconds=661745 connected_nodes=10 remote_node=mongooseim@mongooseim-5.mongooseim.default.svc.cluster.local

Compare old build:

helm install mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4179 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
...wait for 10 nodes...
helm upgrade mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4179 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
... check logs for mongooseim-0 during updates...
helm uninstall mim-test

With new build:

helm install mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4180 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
helm upgrade mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4180 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always

Test docker build for this PR is in esl/MongooseIM#4180

To avoid asking net_kernel to connect back overlappend nodes

codecov · 2023-11-29T16:47:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a2f4327) 98.20% compared to head (d4650e1) 98.25%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   98.20%   98.25%   +0.04%     
==========================================
  Files          10       10              
  Lines         726      745      +19     
==========================================
+ Hits          713      732      +19     
  Misses         13       13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chrzaszcz

This change is a bit unexpected, but it makes sense and looks good. I have one concern - see my comment.

src/cets_ping.erl

chrzaszcz

Looks good 👌

arcusfelis added 4 commits November 29, 2023 15:48

Use can_connect in cets_ping

9dae7e4

Add pre_connect logic to ping

f3b89f3

To avoid asking net_kernel to connect back overlappend nodes

Skip nodes that have no CETS code in cets_ping preconnect

8fffc59

Run pre_connect ping in parallel

c7bbd1c

arcusfelis added 2 commits November 29, 2023 23:54

Avoid multicall RPC if preconnect fails earlier

64f84de

Simplify try_open logic

aa228c8

arcusfelis marked this pull request as ready for review November 30, 2023 08:57

arcusfelis marked this pull request as draft November 30, 2023 09:58

chrzaszcz reviewed Nov 30, 2023

View reviewed changes

src/cets_ping.erl Outdated Show resolved Hide resolved

Use 5 seconds timeout in pre_connect multicall

d4650e1

arcusfelis marked this pull request as ready for review November 30, 2023 10:12

chrzaszcz approved these changes Nov 30, 2023

View reviewed changes

chrzaszcz merged commit 9c5b2c9 into main Nov 30, 2023
8 checks passed

chrzaszcz deleted the ping-uses-can-connect branch November 30, 2023 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if we can connect to the remote node before doing connect_node #44

Check if we can connect to the remote node before doing connect_node #44

arcusfelis commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading

chrzaszcz left a comment

chrzaszcz left a comment

Check if we can connect to the remote node before doing connect_node #44

Check if we can connect to the remote node before doing connect_node #44

Conversation

arcusfelis commented Nov 29, 2023 • edited Loading

codecov bot commented Nov 29, 2023 • edited Loading

Codecov Report

chrzaszcz left a comment

Choose a reason for hiding this comment

chrzaszcz left a comment

Choose a reason for hiding this comment

arcusfelis commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading