Prevent hang on exit while omero.client keepalive is active #424

DavidStirling · 2024-07-31T11:52:23Z

When using omero.client best practice appears to be to have everything in a try block and explicitly call client.closeSession to shut everything down nicely, however users don't always remember to do this.

If a user has called client.enableKeepAlive(x) then exiting without calling closeSession currently seems to cause Python to hang instead of exiting.

Minimal example:

import omero
client = omero.client(host="my.omero.server")
session = client.createSession(username="user.name", password="my.pwd")
client.enableKeepAlive(5)
print("Session is active: ", client.getSessionId())
# A normal exit behaves similarly, but this is clearer in the console
raise Exception("A forced crash")

At least on my end, executing this script will throw the exception but fail to exit the interpreter.

Having investigated this, I believe that the task thread within the omero.util.Resources class fails to shut down when the main thread exits. The result is that, while the interpreter is waiting for all threads to close so that it can exit, the keepalive thread keeps spinning. It's currently set up to wait for an explicit close signal from the main thread, but this obviously never comes if that thread has stopped. You therefore get a hang.

As a quick fix, making these worker threads into daemon threads ensures that they die if the main thread exits, thus fixing the problem. I'm not sure how acceptable this is as the Resources class seems to be used for other repeating processes, but I couldn't find a scenario where these should also continue even in the absence of the main thread. If that's not the case then you could instead build some sort of "is the main thread alive?" check into the task.

It might actually be worth a broader review of omero-py's cleanup/shutdown routines. It looks like much of that code is rather old and so modern Python could have more effective ways to manage connections.

joshmoore · 2024-07-31T14:09:02Z

It might actually be worth a broader review of omero-py's cleanup/shutdown routines. It looks like much of that code is rather old

Definitely! 😄

... and so modern Python could have more effective ways to manage connections.

👍

I'm not sure how acceptable this is as the Resources class seems to be used for other repeating processes, but I couldn't find a scenario where these should also continue even in the absence of the main thread.

Resources was originally built (IIRC) for use in the Processor service. A review of whether or not there's anything that those tasks need time to do might be worth it. Even if that is the case, though, we could certainly have this as a parameter to separate between server and client side.

chris-allan · 2024-08-02T17:17:39Z

Having looked at this in a lot more detail I'm pretty confident this is just a bug fix. There is more detailed explanation here:

The summary is that our code that is in question here was written in the Python 2.4 era and sometime around Python 2.6.5 the semantics of how sys.exit is handled changed. This change resulted more or less in the standard patterns being applied to interpreter exit where they were not previously. The standard pattern for cleanup in the threading module is to join all non-daemon threads. Any thread that a Resources class spawns currently is not a daemon thread and must have the atexit handler triggered to be joinable. Consequently, we deadlock.

The only other place I see omero.util.Resources being used besides the client object keep alives is for instances of Servant. These too would currently deadlock if sys.exit were initiated but that doesn't happen as they are shut down explicitly. I'll check more thoroughly but I can't see a downside to daemon task thread being used there especially with the current implementation's reliance on an atexit handler.

chris-allan · 2024-08-05T08:40:52Z

For reference, current uses of Servant are:

omero.processor.ProcessorI (uses Resources for session management)
omero.tables.TablesI (uses Resources for TableI servant, which uses SimpleServant without Resources, management)

This code was originally added to address https://trac.openmicroscopy.org/ome/ticket/3260 where an exception was being raised: Traceback (most recent call last): File "/usr/lib/python2.4/logging/__init__.py", line 737, in emit self.stream.write(fs % msg) ValueError: I/O operation on closed file Due to the many changes in interpreter behaviour since Python 2.4, including those surrounding sys.exit in the ~2.6.5 era [^1], as well as improvements to the logging module itself this is no longer an issue. In the referenced Python issue 9501, the related code changes were originally committed in Subversion r84282, Mercurial changeset 42c70f17c2fede6116108568a75b0eec08b3c73a, Git commit 1ddd51fc71c6a6d255eee8e7e54d1bf7e05c5841. Found by tracing the Mercurial changeset via `Misc/svnmap.txt` in the cython repository. At a future date `omero.util.concurrency.AtExitEvent` and `omero.util.concurrency.get_event()` should likely be deprecated and their use discontinued. They attempt to handle issues that no longer exist and add complexity and maintenance overhead to our runtime. See also: * https://bugs.python.org/issue6333 * https://bugs.python.org/issue13807 [^1]: https://stackoverflow.com/questions/3713360/python-2-6-x-theading-signals-atexit-fail-on-some-versions/3765160#3765160

chris-allan · 2024-08-05T09:07:45Z

Detailed rationale for the logging changes in 0f8fe1c. Future deprecation, refactoring and cleanup of omero.util.concurrency.AtExitEvent and omero.util.concurrency.get_event() should definitely be done; the code makes Python 2.4 era assumptions.

knabar

Both commits make sense

will-moore · 2024-08-05T15:41:49Z

Without this PR, calling client.enableKeepAlive(5) prevents interpreter from exiting with Ctrl-D.
When this is forced with Ctrl-C I get...

(omeroweb2) Williams-MacBook-Pro:zarr-python wmoore$ python
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:50:29) 
[Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import omero
>>> client = omero.client(host="localhost")
>>> session = client.createSession(username="user", password="pass")
>>> client.enableKeepAlive(5)
>>> 
^CException ignored in: <module 'threading' from '/Users/wmoore/opt/anaconda3/envs/omeroweb2/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/Users/wmoore/opt/anaconda3/envs/omeroweb2/lib/python3.9/threading.py", line 1477, in _shutdown
    lock.acquire()

Updating to this branch fixes that behaviour (clean exit with Ctrl-D).

👍

sbesson

I audited the usage of the omero.util.Resources API across the OME code:

in omero.clients, BaseClient object has a private __resources field which gets filled with one (or theoretically multiple) Entry objects which will periodically running a keepAlive() call
unTables.py starts a omero.tables.TablesI service server-side. This service is a subclass of omero.util.Servant which means it is an omero.util.SimpleServant managing some resources. Each TableI object retrieved via the getTable API is registered as a resource
-runProcessor.py similarly starts the omero.processor.ProcessorI service server-side which is also subclass of omero.util.Servant. Two types of resources are registered in this service:
- the session kept alive via the omero.processor.UseSessionHolder object
- the omero.processor.ProcessI (or subclass) launched by the service
in omero-dropbox, MonitorClientI registers the omero.util.ServerContext as a resource to be able to retrieve the active session

No regression in the existing unit and integration tests have been flagged since this PR got included. As shown above, in many scenarios the resources objects are used to maintain an active connection . Client-side, the keepAlive task should definitely be terminated alongside the main thread which created the session and there are no active resources to cleanup.
Server-side, all usages of resources to maintain an active session should likely be terminated if the process itself is restarted.
I'll perform a round of functional testing trying and review the cleanup conditions for the table/processor use cases. Otherwise, I agree this should likely be released as a bug fix.

sbesson

I ran some additional tests using the Processor service and the following minimal script which is waiting for 2 min and performing an action

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from omero.gateway import BlitzGateway
from omero.rtypes import rstring

import omero.scripts as scripts
import time
if __name__ == "__main__":

    client = scripts.client(
        'Example.py', """This script ...""",
        version="0.1",
    )

    try:
        conn = BlitzGateway(client_obj=client)
        time.sleep(120)
        conn.getObject("Image",12003)
        client.setOutput("Message", rstring("Success"))

    finally:
        client.closeSession()

The Processor service was stopped:

using omero admin ice "server stop Processor-0" or kill <process_pid>
after the execution of the script or during the execution of the script
with OMERO.py 5.19.4 and with OMERO.py built with this PR included

In all scenarios above, the Processor was successfully stopped with the following sequence of logs:

2024-09-04 10:24:17,574 INFO  [                       omero.util.Server] (MainThread) Cleanup
2024-09-04 10:24:17,574 INFO  [                    omero.util.Resources] (Thread-2  ) Halted
2024-09-04 10:24:17,574 INFO  [              omero.processor.ProcessorI] (MainThread) Cleaning up
# Several events about killing session, changing job status...
# The nature of these depending on the sequence of events and whether active processes are running
2024-09-04 10:24:17,579 INFO  [              omero.processor.ProcessorI] (MainThread) Done
2024-09-04 10:24:17,586 INFO  [                       omero.util.Server] (MainThread) Stopped

The omero.util.Resources log statement indicate the Task.run method is returning which is the expectation. The processor was successfully auto-restarted in all cases and new scripts were executable.

The only minor RFE that arose from this testing is that when interrupting the Processor while a script was running, an active OMERO.web session trying to poll the results of the script will throw the generic 500 page with the following stack trace

Traceback (most recent call last):
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 538, in wrapped
    retval = f(request, *args, **kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 597, in wrapper
    context = f(request, *args, **kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/webclient/views.py", line 3864, in activities
    cb = ScriptsCallback(conn.c, proc)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero/callbacks.py", line 72, in __init__
    process.registerCallback(self.prx)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero_Scripts_ice.py", line 1115, in registerCallback
    return _M_omero.grid.Process._op_registerCallback.invoke(self, ((cb, ), _ctx))
Ice.ObjectNotExistException: exception ::Ice::ObjectNotExistException
{
    id = 
    {
        name = D515E601-3969-445A-BB3D-382778797953
        category = 
    }
    facet = 
    operation = registerCallback
}
<WSGIRequest: GET '/webclient/activities/?_=1725446734744'>

and refreshing the window still displays the 500 so I had to log out/log in again. I assume this is possibly something that could be handled more gracefully at the OMERO.web level, handling the underlying server exception.

Unless someone thinks of another testing scenario to evaluate, I would be inclined to get this released as OMERO.py 5.19.5 and start consuming it in various places to ensure it does not cause any unexpected regression

jburel · 2024-09-16T14:36:51Z

Released in 5.19.5

Set daemon status on keepalive thread

5891208

chris-allan requested review from knabar, will-moore and sbesson August 5, 2024 09:08

knabar approved these changes Aug 5, 2024

View reviewed changes

chris-allan mentioned this pull request Aug 5, 2024

Use connection manager for keepalives glencoesoftware/omero2pandas#20

Merged

will-moore approved these changes Aug 5, 2024

View reviewed changes

sbesson reviewed Aug 27, 2024

View reviewed changes

sbesson approved these changes Sep 4, 2024

View reviewed changes

jburel merged commit 789a097 into ome:master Sep 16, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent hang on exit while omero.client keepalive is active #424

Prevent hang on exit while omero.client keepalive is active #424

DavidStirling commented Jul 31, 2024

joshmoore commented Jul 31, 2024

chris-allan commented Aug 2, 2024

chris-allan commented Aug 5, 2024

chris-allan commented Aug 5, 2024

knabar left a comment

will-moore commented Aug 5, 2024

sbesson left a comment

sbesson left a comment

jburel commented Sep 16, 2024

Prevent hang on exit while omero.client keepalive is active #424

Prevent hang on exit while omero.client keepalive is active #424

Conversation

DavidStirling commented Jul 31, 2024

joshmoore commented Jul 31, 2024

chris-allan commented Aug 2, 2024

chris-allan commented Aug 5, 2024

chris-allan commented Aug 5, 2024

knabar left a comment

Choose a reason for hiding this comment

will-moore commented Aug 5, 2024

sbesson left a comment

Choose a reason for hiding this comment

sbesson left a comment

Choose a reason for hiding this comment

jburel commented Sep 16, 2024