Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1173: problems stopping instances #1178

Merged
merged 6 commits into from
Aug 22, 2024
Merged

Fix #1173: problems stopping instances #1178

merged 6 commits into from
Aug 22, 2024

Conversation

reidsunderland
Copy link
Member

This fixes multiple problems:

  1. Multi-instance configs don't stop correctly. For unknown reasons, the signal handler in flow/__init__.py works in instance 1 and cleanly stops the instance, but the signal handler in moth/amqp.py takes over in instances >1. The one in flow/__init__.py never executes, so the instance never shuts itself down. This forces sr3 stop to always kill instances > 1.

    Only one signal handler at a time can be used, so whichever one was registered last was being called.

  2. Because the signal handler in moth/amqp.py was not working for instance 1, when AMQP was stuck in a long sleep because it couldn't connect to the broker, it would not shut itself down and would eventually get killed.

  3. please_stop callbacks were not being called, except in two rare cases (when messageCountMax is enabled and the max. message count is reached; or in configs where sleep is <0) (regression from please stop infinite loop by calling self #1070).

    Now please_stop callbacks are called when SIGTERM or SIGINT is received.

  4. Adds a reusable interruptible_sleep function, a sleep that can be interrupted a variable becoming True.

I'm not 100% sure it actually fixed the problem, but I did multiple runs of flakey_broker and never saw any unacked messages... it could be a coincidence.

@reidsunderland reidsunderland added bug Something isn't working Refactor change implementation of existing functionality. ReliabilityRecovery improve behaviour in failure situations. labels Aug 21, 2024
Copy link

Test Results

247 tests   245 ✅  1m 21s ⏱️
  1 suites    1 💤
  1 files      1 ❌

For more details on these failures, see this check.

Results for commit 56d2be7.

@reidsunderland
Copy link
Member Author

The reason why I'm wondering if this has improved the unacked message situation in flakey_broker is because the polls were being killed previously (case 2 above):

stopping rabbitmq-server
switching active poll config: polls in wVip 1
Stopping: sending SIGTERM .. ( 2 ) Done
Waiting 1 sec. to check if 2 processes stopped (try: 0)
Waiting 2 sec. to check if 2 processes stopped (try: 1)
Waiting 4 sec. to check if 2 processes stopped (try: 2)
Waiting [8](https://github.com/MetPX/sarracenia/actions/runs/10455426424/job/28950336196#step:5:9) sec. to check if 2 processes stopped (try: 3)
Waiting 16 sec. to check if 2 processes stopped (try: 4)
doing SIGKILL this time
signal_pid( 12[9](https://github.com/MetPX/sarracenia/actions/runs/10455426424/job/28950336196#step:5:10)85, SIGKILL )
.Done
Waiting again...
not responding to SIGKILL:

And now they are not being killed, because of the interruptible_sleep, so there's an improvement there. I just not sure if it has actually fixed the unacked messages problem, it still could be coincidental that I've never got unacked messages with the changes on this branch. Waiting for GitHub Actions flow tests to finish to have a few more flakey runs.

@reidsunderland
Copy link
Member Author

And GitHub Actions did give a counter-example that proves me wrong :(

https://github.com/MetPX/sarracenia/actions/runs/10493296674/job/29066922812?pr=1178

Here the polls never got killed, but there were still waiting unacked and ready messages.

Copy link
Contributor

@petersilva petersilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand, but this change looks great!
I ran the flow tests on 24.04, and it does seem to run much more cleanly.
There isn't the same meandering about with termination of the flakey test.
I think it's a big improvement.

@petersilva petersilva merged commit a9334a7 into development Aug 22, 2024
20 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Refactor change implementation of existing functionality. ReliabilityRecovery improve behaviour in failure situations.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants