Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved discovery backend #24

Merged
merged 22 commits into from
Aug 31, 2023
Merged

Improved discovery backend #24

merged 22 commits into from
Aug 31, 2023

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Aug 23, 2023

Adresses MIM-2018.

Make cets_discovery module more async and better tested.
This should avoid long locking time at startup in some cases (for example, there are a lot of dead nodes returned by the discovery backend).

Changes:

  • Make discovery gen_server non-blocking.
  • Adding wait_for_ready functionality - could be useful to avoid not-fully ready servers to be exposed to the users (i.e. to avoid accepting incoming connection on a new node, before it copied the session table, for example)
  • Smarter timeouts for retries. We use a different timeout after an error. We also don't call get_nodes too often once started.
  • Specs.
  • More tests with better coverage.
  • Fixes small race conditions in insert_new tests (rare, but they were found by rerunning tests).
  • Use peer module to do RPCs.
  • Some netsplit tests.

You can find MIM branch that uses this code here: esl/MongooseIM#4111 (for testing)

Command to test helm (ensure you use k8s from Docker Desktop and having helm installed):

# Setup DB using MIM test scripts:
DB="pgsql" ./tools/setup-db.sh
# Ensure MIM stopped...
killall beam.smp
# Get helm scripts
git clone https://github.com/esl/MongooseHelm.git
cd MongooseHelm
# Ensure old hem cluster is dead
helm uninstall mim-test
# Test with these changes
helm install mim-test MongooseIM --set replicaCount=3 --set image.tag=PR-4111 --set persistentDatabase=rdbms --set rdbms.username=ejabberd --set rdbms.password=mongooseim_secret --set rdbms.database=ejabberd --set volatileDatabase=cets
# Check that all is fine in logs:
kubectl get pod
kubectl logs mongooseim-0
kubectl exec -it mongooseim-0 -- mongooseimctl cets systemInfo

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +4.86% 🎉

Comparison is base (3144dc0) 91.01% compared to head (079b0e2) 95.88%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   91.01%   95.88%   +4.86%     
==========================================
  Files           8        8              
  Lines         434      510      +76     
==========================================
+ Hits          395      489      +94     
+ Misses         39       21      -18     
Files Changed Coverage Δ
src/cets.erl 100.00% <ø> (+0.57%) ⬆️
src/cets_discovery.erl 91.85% <100.00%> (+34.70%) ⬆️
src/cets_join.erl 98.93% <100.00%> (+0.99%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Fixed dialyzer
Fixed a race condition in test_disco_add_two_tables testcase
peer:rpc is pretty cool: it allows to apply code on nodes without touching
erlang distribution protocol

Which means we can actually setup discovery tests
Add insert_returns_when_netsplit/inserts_after_netsplit_reconnects testcases
Add cets_seq group for netsplits
@arcusfelis arcusfelis marked this pull request as ready for review August 29, 2023 22:15
Copy link
Contributor

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few very small comments

Comment on lines 204 to 205
local_servers => LocSet,
remote_servers => RemSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't these be more unreadable in the logs than a list? I'd say it depends on whether the list had duplicates. If ordsets are simply ordering the list into a data structure that is faster to read or calculate intersections then I'd prefer we print the list, it's just more readable 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, I did the change for logs to make logs easier to read.
get_pids() does return [Server | ordeset()]. So, I can just force to use ordsets everywhere, no need to allow non-sorted lists.

AvailableNodes = nodes(),
spawn_link(fun() ->
%% We only care about connected nodes here
%% We do not wanna try to connect here - we do it in ping_not_connected_nodes/1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%% We do not wanna try to connect here - we do it in ping_not_connected_nodes/1
%% We do not want to try to connect here - we do it in ping_not_connected_nodes/1

Comment on lines 317 to 323
%% Returns timeout in milliseconds to retry calling the get_nodes function.
%% get_nodes is called after add_table without waiting.
%% It is also would be retried without waiting if should_retry_get_nodes set to true.
-spec retry_type_to_timeout(retry_type()) -> 5000 | 1000 | 180000.
retry_type_to_timeout(initial) -> 5000;
retry_type_to_timeout(after_error) -> 1000;
retry_type_to_timeout(regular) -> 180000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know from the top of my mind how many minutes 180000ms is, so maybe at a comment, 180000ms = 5min, or, call timer:minutes(5), to make it more readable (comment and direct value might be "more performant though" xD)

Copy link
Contributor

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥

@NelsonVides NelsonVides merged commit 21f0479 into main Aug 31, 2023
6 checks passed
@NelsonVides NelsonVides deleted the better-disco branch August 31, 2023 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants