Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log error once in nested long calls #38

Merged
merged 9 commits into from
Nov 6, 2023

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Oct 24, 2023

The PR reduces the amount of noise reported in "MIM-2073 Multiple errors in the logs on cluster startup with CETS" issue.

Before the same error was reported 3 times from 3 different places (we do over-reporting):

when=2023-10-17T19:10:31.083740+00:00 level=error what=long_task_failed reason={{nodedown,'[email protected]'},{gen_server,call,[<30887.545.0>,other_servers,infinity]}} pid=<0.567.0> at=cets_long:run_tracked/2:63 stacktrace="gen_server:call/3:401 cets_long:run_tracked/2:54 cets_join:get_pids/1:192 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:check_fully_connected/2:264 cets_join:check_pids/4:199 cets_join:join2/4:109" server=<30887.545.0> msg=other_servers pid=<30887.545.0> node=mongooseim@mongooseim-1.mongooseim.default.svc.cluster.local
when=2023-10-17T19:10:31.084280+00:00 level=error what=long_task_failed reason={{nodedown,'[email protected]'},{gen_server,call,[<30887.545.0>,other_servers,infinity]}} pid=<0.567.0> at=cets_long:run_tracked/2:63 stacktrace="gen_server:call/3:401 cets_long:run_tracked/2:54 cets_join:get_pids/1:192 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:check_fully_connected/2:264 cets_join:check_pids/4:199 cets_join:join2/4:109" local_pid=<0.545.0> remote_pid=<30886.545.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id
when=2023-10-17T19:10:31.086471+00:00 level=error what=long_task_failed reason={{nodedown,'[email protected]'},{gen_server,call,[<30887.545.0>,other_servers,infinity]}} pid=<0.561.0> at=cets_long:run_tracked/2:63 stacktrace="gen_server:call/3:401 cets_long:run_tracked/2:54 cets_join:get_pids/1:192 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:'-check_fully_connected/2-lc$^0/1-0-'/1:264 cets_join:check_fully_connected/2:264 cets_join:check_pids/4:199 cets_join:join2/4:109" long_task_name=join local_pid=<0.545.0> remote_pid=<30886.545.0> remote_node=mongooseim@mongooseim-0.mongooseim.default.svc.cluster.local table=cets_cluster_id

Once inside cets call and 2 times inside the cets_join module.

Changes:

  • This PR removes 2 extra logs.
  • Only the most nested call would be logged. In our case - we log from cets module. cets_join would remain silent.
  • We add task_failed wrapping when we re-throw the error, to avoid it being reported by the nested long (call from cets_long module) call.
  • This PR only removes log duplication, in the real life you would still see a lot of messages: because of retry logic and because a message could be potentially logged for each unique table name. At least the amount of logged messages is less now.
  • Updated the version of the debug log handler dependency. Because OTP Logger has real hard time when add_handler/remove_handler are called from the different processes (there are race condition bugs there, we would need to create a bug report to the OTP. For now, we would have to use global:transaction when adding log handlers in our tests).

Useful for distinguising reports from the parallel tasks for example
Sadly, MongooseIM does not report pid in logs
Log only first time when having nested long tasks
Remove long from the task tags to not to confuse with actually long tasks
(though, progress logging would still say "long")
@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f131ac9) 97.46% compared to head (f967044) 97.95%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
+ Coverage   97.46%   97.95%   +0.49%     
==========================================
  Files           9        9              
  Lines         630      637       +7     
==========================================
+ Hits          614      624      +10     
+ Misses         16       13       -3     
Files Coverage Δ
src/cets_long.erl 100.00% <100.00%> (+5.40%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

And to check that logging actually works there
@arcusfelis arcusfelis marked this pull request as ready for review October 24, 2023 07:49
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks 👌

src/cets_long.erl Outdated Show resolved Hide resolved
We use the latest version of debug_log_handler, which has fixes
for race conditions in the Logger
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@chrzaszcz chrzaszcz merged commit 4f59847 into main Nov 6, 2023
8 checks passed
@chrzaszcz chrzaszcz deleted the log-error-once-in-nested-long-calls branch November 6, 2023 08:20
arcusfelis added a commit to esl/MongooseIM that referenced this pull request Nov 7, 2023
Includes Log error once in nested long calls esl/cets#38
Includes Fix never returning gen_server:call(Server, sync) call esl/cets#37
Use ping_all instead of cets:sync (it has been renamed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants