gRPC async server looses track of `_futures` #1652

peter-resnick · 2024-03-29T16:56:46Z

Hi MLServer -

To start off, this is an awesome tool and the team has impressive work to get to this point.

I'm currently using MLServer in a high-throughput, low-latency system where we use gRPC to perform inferences. We have added an asynchronous capability into our inference client which sends many requests to the gRPC server at once (typically about 25). We have a timeout set on our client and we first started seeing a number of DEADLINE_EXCEEDED responses and I started to look into the model servers themselves to figure out why the server had started to exceed deadlines (we hadn't experienced this very often in the past) and it looks like the process loop is actually being restarted due messages being lost.

We see the following traceback:

2024-03-28 19:56:42,015 [mlserver.parallel] ERROR - Response processing loop crashed. Restarting the loop...
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 186, in _process_responses_cb
    process_responses.result()
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 207, in _process_responses
    self._async_responses.resolve(response)
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 102, in resolve
    future = self._futures[message_id]
KeyError: 'cea95af0-859f-413a-a033-dfbe51e96c05'

where the dispatcher is trying to check on a given message, but it's lost.
^ once this error occurs once, all of the rest of our parallel inference requests fail with the same exception (different message_id obviously).

I took a look at the source code and it looks like when the process_response.result() is called, the logic has a blanket exception for anything that isnt an asyncio.CancelledError and assume that the process loop has crashed, so it restarts it by scheduling a new task, but it's not immediately clear (to me, at least) if this is really what should be happening. I don't see any signals from the server that the processing loop actually crashed - it just seems to be confused about which message its supposed to be getting.

As a note about our system set up, we have these deployed into Kubernetes (so is our client app) as a deployment with between 10-15 pods at any given time with environment variable MLSERVER_PARALLEL_WORKERS=16.

We are also using a grpc.aio.insecure_channel(server) pattern to manage the gRPC interactions on the client side.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC async server looses track of `_futures` #1652

gRPC async server looses track of `_futures` #1652

peter-resnick commented Mar 29, 2024 •

edited

Loading

gRPC async server looses track of _futures #1652

gRPC async server looses track of _futures #1652

Comments

peter-resnick commented Mar 29, 2024 • edited Loading

gRPC async server looses track of `_futures` #1652

gRPC async server looses track of `_futures` #1652

peter-resnick commented Mar 29, 2024 •

edited

Loading