Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if we can connect to the remote node before doing connect_node #44

Merged
merged 7 commits into from
Nov 30, 2023

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Nov 29, 2023

There is a situation during upgrades:

  • node is about to restart.
  • it closes one of connections to the remote node.
  • it immediately gets contacted again.
  • not all nodes would be able to connect.
  • we see node_reconnects warning and global complaining about overlapped networks.
  • So, we got nodedown, nodeup, nodedown very quickly instead of just getting nodedown in this situation.

This PR solves node_reconnects warning by ensuring we only call net_kernel:connect_node if still could connect to the node.
Global overlapped warning could still be here, because it is triggered by first disconnect.

Does it increases work we need to do?

  • a bit. Usually the check would return earlier because of lists:member(Node, nodes()) returning true.
  • handling nodedown though is very expensive operation - we need to cleanup ETS tables. so it is good to avoid it even by creating testing connections when trying to connect for the first time.

It also would protect us from trying to connect to the node which is dead, but still in DB.

Initial version of the fix had can_connect in mongoose_epmd. There are things to consider:

  • doing the check in mongoose_epmd would block net_kernel. This would block opening debug shell.
  • doing this check in cets_ping would only block cets_join (and cets_discovery ping call, but this one is async) - which we are perfectly fine with.
  • Only cets_discovery makes brand new connecting attempts.
  • Global does connects because of connect_all - but it is first triggered by cets_discovery.
  • This code does not guarantee that the following connect_node would succeed. It just improves the odds.

This PR fixes level=warning what=node_reconnects in logs during upgrade:

when=2023-11-30T08:55:51.462816+00:00 level=error what=task_failed reason=noproc pid=<0.18856.0> at=cets_long:monitor_loop/5:98 reported_by=monitor_process caller_pid=<0.18851.0> task=cets_wait_for_get_nodes
when=2023-11-30T08:55:51.520335+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=172 time_since_startup_in_milliseconds=661646 connected_nodes=7 remote_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.525983+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=162 time_since_startup_in_milliseconds=661651 connected_nodes=8 remote_node=mongooseim@mongooseim-7.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.538397+00:00 level=warning what=node_reconnects pid=<0.428.0> at=cets_discovery:handle_receive_start_time/3:593 text="Netsplit recovery. The remote node has been connected to us before." start_time=1701334441099 remote_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.551818+00:00 level=warning what=node_reconnects pid=<0.428.0> at=cets_discovery:handle_receive_start_time/3:593 text="Netsplit recovery. The remote node has been connected to us before." start_time=1701334390440 remote_node=mongooseim@mongooseim-7.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.568547+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=197 time_since_startup_in_milliseconds=661694 connected_nodes=9 remote_node=mongooseim@mongooseim-8.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.614467+00:00 level=warning what=cleaner_nodedown pid=<0.425.0> at=mongoose_cleaner:handle_info/2:55 text="mongoose_cleaner received nodenown event" down_node=mongooseim@mongooseim-6.mongooseim.default.svc.cluster.local
when=2023-11-30T08:55:51.619700+00:00 level=warning what=nodeup pid=<0.428.0> at=cets_discovery:handle_nodeup/2:525 downtime_millisecond_duration=271 time_since_startup_in_milliseconds=661745 connected_nodes=10 remote_node=mongooseim@mongooseim-5.mongooseim.default.svc.cluster.local

Compare old build:

helm install mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4179 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
...wait for 10 nodes...
helm upgrade mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4179 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
... check logs for mongooseim-0 during updates...
helm uninstall mim-test

With new build:

helm install mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4180 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always
helm upgrade mim-test MongooseIM --set replicaCount=10 --set image.tag=PR-4180 --set persistentDatabase=rdbms --set rdbms.username=mongooseim --set rdbms.database=mongooseim --set volatileDatabase=cets --set image.pullPolicy=Always

Test docker build for this PR is in esl/MongooseIM#4180

Copy link

codecov bot commented Nov 29, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a2f4327) 98.20% compared to head (d4650e1) 98.25%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   98.20%   98.25%   +0.04%     
==========================================
  Files          10       10              
  Lines         726      745      +19     
==========================================
+ Hits          713      732      +19     
  Misses         13       13              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@arcusfelis arcusfelis marked this pull request as ready for review November 30, 2023 08:57
@arcusfelis arcusfelis marked this pull request as draft November 30, 2023 09:58
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit unexpected, but it makes sense and looks good. I have one concern - see my comment.

src/cets_ping.erl Outdated Show resolved Hide resolved
@arcusfelis arcusfelis marked this pull request as ready for review November 30, 2023 10:12
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👌

@chrzaszcz chrzaszcz merged commit 9c5b2c9 into main Nov 30, 2023
8 checks passed
@chrzaszcz chrzaszcz deleted the ping-uses-can-connect branch November 30, 2023 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants