Fix #1173: problems stopping instances #1178

reidsunderland · 2024-08-21T15:58:52Z

This fixes multiple problems:

Multi-instance configs don't stop correctly. For unknown reasons, the signal handler in flow/__init__.py works in instance 1 and cleanly stops the instance, but the signal handler in moth/amqp.py takes over in instances >1. The one in flow/__init__.py never executes, so the instance never shuts itself down. This forces sr3 stop to always kill instances > 1.

Only one signal handler at a time can be used, so whichever one was registered last was being called.
Because the signal handler in moth/amqp.py was not working for instance 1, when AMQP was stuck in a long sleep because it couldn't connect to the broker, it would not shut itself down and would eventually get killed.
please_stop callbacks were not being called, except in two rare cases (when messageCountMax is enabled and the max. message count is reached; or in configs where sleep is <0) (regression from please stop infinite loop by calling self #1070).

Now please_stop callbacks are called when SIGTERM or SIGINT is received.
Adds a reusable interruptible_sleep function, a sleep that can be interrupted a variable becoming True.

I'm not 100% sure it actually fixed the problem, but I did multiple runs of flakey_broker and never saw any unacked messages... it could be a coincidence.

…factor stop/signal handling routines

I've worked on this code before and it still took me a long time to re-figure out why it was doing that.

github-actions · 2024-08-21T16:01:08Z

Test Results

247 tests 245 ✅ 1m 21s ⏱️
1 suites 1 💤
1 files 1 ❌

For more details on these failures, see this check.

Results for commit 56d2be7.

reidsunderland · 2024-08-21T16:19:47Z

The reason why I'm wondering if this has improved the unacked message situation in flakey_broker is because the polls were being killed previously (case 2 above):

stopping rabbitmq-server
switching active poll config: polls in wVip 1
Stopping: sending SIGTERM .. ( 2 ) Done
Waiting 1 sec. to check if 2 processes stopped (try: 0)
Waiting 2 sec. to check if 2 processes stopped (try: 1)
Waiting 4 sec. to check if 2 processes stopped (try: 2)
Waiting [8](https://github.com/MetPX/sarracenia/actions/runs/10455426424/job/28950336196#step:5:9) sec. to check if 2 processes stopped (try: 3)
Waiting 16 sec. to check if 2 processes stopped (try: 4)
doing SIGKILL this time
signal_pid( 12[9](https://github.com/MetPX/sarracenia/actions/runs/10455426424/job/28950336196#step:5:10)85, SIGKILL )
.Done
Waiting again...
not responding to SIGKILL:

And now they are not being killed, because of the interruptible_sleep, so there's an improvement there. I just not sure if it has actually fixed the unacked messages problem, it still could be coincidental that I've never got unacked messages with the changes on this branch. Waiting for GitHub Actions flow tests to finish to have a few more flakey runs.

reidsunderland · 2024-08-21T16:21:58Z

And GitHub Actions did give a counter-example that proves me wrong :(

https://github.com/MetPX/sarracenia/actions/runs/10493296674/job/29066922812?pr=1178

Here the polls never got killed, but there were still waiting unacked and ready messages.

petersilva

It took me a while to understand, but this change looks great!
I ran the flow tests on 24.04, and it does seem to run much more cleanly.
There isn't the same meandering about with termination of the flakey test.
I think it's a big improvement.

reidsunderland added 6 commits August 21, 2024 13:18

First attempt to fix #1173 by removing signal handler from Moth classes

fefec16

fix typo

722a2b4

flowcb please_stop method was not being called in most cases

76e53f1

Implement interruptible_sleep that can be interrupted by a signal; re…

302cd98

…factor stop/signal handling routines

Fix typo noticed by linter

e6b1770

Add comments to explain cases where the flow loop will shut itself down.

56d2be7

I've worked on this code before and it still took me a long time to re-figure out why it was doing that.

reidsunderland added bug Something isn't working Refactor change implementation of existing functionality. ReliabilityRecovery improve behaviour in failure situations. labels Aug 21, 2024

reidsunderland requested a review from petersilva August 21, 2024 15:58

petersilva approved these changes Aug 22, 2024

View reviewed changes

petersilva merged commit a9334a7 into development Aug 22, 2024
20 of 48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1173: problems stopping instances #1178

Fix #1173: problems stopping instances #1178

reidsunderland commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

reidsunderland commented Aug 21, 2024

reidsunderland commented Aug 21, 2024

petersilva left a comment

Fix #1173: problems stopping instances #1178

Fix #1173: problems stopping instances #1178

Conversation

reidsunderland commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

Test Results

reidsunderland commented Aug 21, 2024

reidsunderland commented Aug 21, 2024

petersilva left a comment

Choose a reason for hiding this comment