Subscription streaming pull requests are disconnecting after 15 minutes #1135

feywind · 2020-10-21T17:55:59Z

This started out of this issue: #979

I just reverted the change that exposed the stream cancellations, but they are still happening under the covers. I think it would behoove us to figure out why. The commonality that I've seen so far is that it generally involves crossing a Docker networking boundary. A working reproduction is available on that issue above, though it only seems to happen on grpc-js, so it might also have something to do with the Node HTTP/2 stack.

Several question marks here, and I just don't want to lose track of that investigation, even if we fix the user breakage for now.

dpkirchner · 2020-12-10T23:33:43Z

Is this issue intended to track or resolve the reconnect issues from #979 and #770 in addition to the disconnects?

jeffijoe · 2020-12-18T16:00:04Z

We are still seeing this issue, with the addition of the following being logged to console:

Connection to [object Object] rejected by server because of excess pings

feywind · 2021-04-26T18:38:37Z

Yikes... so the new behaviour is that it just sort of silently stops receiving messages?

Do you happen to have grpc debug output? You can get that by setting the environment variables:

export GRPC_VERBOSITY=DEBUG
export GRPC_TRACE=all

feywind · 2021-04-26T19:23:15Z

And one more question, does your use case involve Docker? (GKE or whatnot...) I've found that that seems to be a commonality for this problem.

jeffijoe · 2021-04-26T19:25:20Z

I think this is what we have been experiencing for the past few years, but it went from Call canceled to silent connection drop.

@feywind this issue is happening with the real PubSub service as well, same conditions. It appears that a quiet connection is closed. It does not happen in production for us (at least not yet) because it stays busy.

feywind · 2021-04-26T20:13:16Z

My vague recollection of debugging one of the linked issues is that I only saw this happening (and it was really reproducible) when there was a Docker network boundary involved. So emulator in a container, client on GKE, or something like that. I worked with the grpc team to try to figure out why, and I don't think we ever found anything super useful in the grpc traces either. :( We thought it was at least worked around by way of papering over the reconnects, but it sounds like that's not happening either.

So I see two issues:

The client library behaviour is not great for this error if it just sort of silently stops working. :|

For this one, I figured that letting it retry when disconnected would at least roll us back to where we were. It seems like maybe that's not working though...

HTTP/2 connections mysteriously go away after a while, especially on Docker

This one I'm less sure about, but I can get my repro case going again and bug the grpc team to see if they can find anything else here. I still suspect weirdness between the Node HTTP/2 stack and Docker iptables rules used to do container isolation, but that's admittedly just a hunch so far.

feywind · 2021-04-27T14:37:45Z

Thanks! We can also work out a non-GitHub-public-issue way to get the logs over, if that helps.

feywind · 2021-04-27T18:41:39Z

@murgatroid99 might know the GRPC_TRACE answer.

murgatroid99 · 2021-04-27T18:45:36Z

grpc-js uses the same environment variables, and GRPC_TRACE=all will still trace everything.

murgatroid99 · 2021-04-27T18:47:58Z

If you want to narrow the trace output, the output format from grpc-js is <timestamp> | <tracer name> | <log text>. The value of the GRPC_TRACE environment variable can also be a comma-separated list of tracer names, and it will then log only those lines.

jeffijoe · 2021-04-27T21:45:03Z

Grasping for straws here but I wonder if this is related: https://www.hostedgraphite.com/blog/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug

meredithslota · 2022-05-19T16:52:20Z

(Still marked external since we think it's an issue elsewhere but impacting users of this library — we just haven't been able to nail down what the specific issue is yet.) Any additional repro steps would be super helpful!

feywind · 2022-08-17T20:25:28Z

Linked to the meta-issue about transport problems: b/242894947

edorivai · 2022-11-24T10:50:25Z

To add a data point here. We're seeing something similar happen in our setup:

Pubsub topic
- around 20K messages published per second
- message ordering enabled
GKE deployment for subscribers with an autoscaling pool of 8-20 pods. (1 subscriber per pod)
- Node.js subscribers using @google-cloud/[email protected]
- Streaming pull (pubsub.subscription(name).on('message', ...))
subscribers seemingly randomly stop receiving events (more details below)
we are able to reproduce this without a docker network boundary present

Normal operations

The system performs as expected for hours on end, processing keeps up with publishing:

Backlog accumulating

However, zooming out we observe longer periods of time where the subscriber pool doesn't keep up:

Throughput RCA

We used to have CPU bottlenecks in our Postgres instance, but have successfully solved those by fronting with Redis.

Note the spikes up to 100% CPU before Nov 21. After Nov 21 we've implemented caching and confirmed that PG is not a bottleneck any longer.

Redis was never a bottleneck, but we did implement some in-memory caching to alleviate the load in redis as well

And to close the loop, our GKE pool is autoscaling successfully, so no CPU bottlenecks there either

All this to say that we're pretty sure that the bottleneck lies elsewhere.

Subscribers stop receiving messages

We added some custom logging to our subscribers, which accumulates counts on how many messages are processed at 10 sec intervals.

When operating normally, it looks like this:

That first log line indicates that one subscriber processed 34050 messages in 10 seconds, that count is also added to the summary line as seen in the subsequent log lines.

We started noticing that when a backlog accumulates, multiple subscribers are not processing any messages:

This brings down the processing rate of the subscriber pool, causing the backlog to grow.

We added listeners for error, close and debug events on the subscription, but none of those events fire when this happens.

Workaround

As a hail-mary, we decided to implement the workaround mentioned here - basically turning it off and on again 🎉. If a subscriber is detected to not receive any events, we .close() and immediately .open() the subscription.

This has been very successful so far:

This graphs the number of log lines with messages=0, the workaround was deployed right before 18:00.

It has not fixed the problem 100%, the fix was deployed yesterday, nov 23. We still observe some backlog spikes today (nov 24), but it does seem that the system is able to recover way more quickly than before the fix.

Theory

I'm not familiar with the inner workings of pubsub, but I have one theory of what might cause this:

message ordering is turned on, so it's probably sharding the messages under the hood
I imagine that to assure subscriber affinity, a single shard can only be serviced by a single subscriber at any time
due to qualitative differences between messages, some shards may take substantially longer to process individual messages than others
this would cause the queue in the shard to grow faster than other shards
pubsub decides it would like to split up the big (or slow) shard
to do so, it needs to flush it first
the shard stop receiving messages
at some point it's empty, the subscriber connected to the shard stops receiving messages
the ordering keys originally assigned to the slow shard are now distributed over other (potentially new) shards
for whatever reason, the subscriber doesn't reconnect to other shards (or perhaps it takes a really long time)
us restarting the subscriber connection (close, then open) allows the subscriber to disconnect from the flushed shard and reconnect to another one
it starts receiving messages again

Local debugging

To debug this, I ran one subscriber instance on my local machine. Note: in contrast to the production GKE deployment, this local Node.js process did not run within docker.

I observed this subscriber receiving messages, then dropping down to 0 messages over 10-sec intervals. That triggered the restart a couple of times, then at some point it did start receiving messages again. This pattern repeats a couple times.

If my theory is half-correct, I could imagine this subscriber connecting to an empty shard a couple of times, until at some point the restart allowed it to connect to a live shard.

Hope this helps, FWIW we have a running support contract with GCP and I'm open to hop on a call and poke at our system with somebody from GCP.

feywind · 2022-11-24T20:18:37Z

@edorivai Thanks for the really detailed comment! I think we're seeing this in situations other than just ordering keys, but let me tag in @kamalaboulhosn in regards to the service-side speculation. (Kamal: look for the Theory heading above)

Most of the related team is off for US holidays right now, so it may be a bit.

We still haven't been able to find the fire to go with the smoke on this issue, so I am starting to think maybe we should just temporarily implement the workaround everyone is using anyway ("have you tried turning off and back on again?" :) I don't think it's something we should encourage for regular use or leave in there indefinitely, but I also don't like users having to deal with it for so long.

kamalaboulhosn · 2022-11-28T12:21:33Z

This isn't exactly how Pub/Sub's sharding of ordering keys works on the subscribe side. The only trigger that would cause changing the association of ordering keys with a subscriber would be the addition of new subscriber clients that need to receive messages. That would result in no messages being sent out on some set of keys until the outstanding messages for those keys are acknowledged. However, this is not tied to the publishing of messages, so the shard does not need to be empty.

It is possible for subscribers to be assigned a set of keys that have no publishes as we don't balance keys based on load to subscribers. This would depend greatly on the number of subscriber clients and the diversity of ordering keys used.

edorivai · 2022-11-28T13:54:40Z

@kamalaboulhosn thank you for that context!

We explicitly tested whether auto-scaling (changing the no. of subscribers) caused these throughput issues. We basically ran our GKE workload on a fixed number (12) of pods. Even under those constraints, we saw a lot of periods where many (more than 50% of the pool) subscribers would receive no messages.

Additionally, in our case, the number of ordering keys are fairly constant over time.

modestaspruckus · 2023-04-14T17:34:30Z

Hey, I'm experiencing same issue. My subscribers with flow control are not pulling any new unacked messages. I saw this happens at exact time when invalid_argument errors appears on streaming pull API

package version 3.4.1

Tried to disable exactlyOnceDelivery - still same.

kamalaboulhosn · 2023-06-20T18:14:50Z

This issue has covered a lot of different causes and investigations that are not entirely related including issues around ordered, unordered, and exactly once subscriptions. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks!

dor-benatia-vcita · 2023-06-25T11:39:17Z

why was this closed ? i mean closing and opening subscription every 15 minutes does work but as a workaround - I am very sceptic this is the good practice.

webnator · 2023-07-25T05:44:08Z

I don't really get why was this closed. This is still an issue, I'm getting the same error using

@google-cloud/pubsub": "^3.7.1
node:18-alpine

I have all listeners properly implemented (on 'error', 'close') yet, when the error happens is not being handled by any of them, and just silently stops listening. I think turning it on and off again is not a very clean solution for this, specially in production.

edosrecki · 2024-02-08T15:18:23Z

I did a lot of investigation into this issue and decided to open up a follow-up issue with my findings: #1885

philvmx1 · 2024-02-23T15:32:14Z

also an issue with dotnet client PubSub.V1 but only with OrderingKey (we have only a few) - so it's gotta be a core issue related to that. I think this needs to be escalated to the core pubsub not client libs.

jeffijoe · 2024-02-23T15:35:51Z

@philvmx1 we also use the Dotnet client but no ordering keys, same issue for us there. At least in Dotnet we can detect it and reconnect.

philvmx1 · 2024-02-23T16:12:03Z

At least in Dotnet we can detect it and reconnect.

@jeffijoe can you share a snippet of how you are doing that please?

philvmx1 · 2024-02-23T16:14:20Z

Looking at source code, I wonder if the server is hanging up causing IsRpcCancellation thus breaking the while loop.

                if (next.IsPull && (task.IsCanceled || (task.IsFaulted && (task.Exception.IsCancellation() || task.Exception.IsRpcCancellation()))))
                {
                    // Pull has been cancelled by user, shutdown pull stream and don't run continuation.
                    // RPC exceptions are dealt with in the relevant handlers.
                    StopStreamingPull();
                    // Pull process has been stopped, wait for push process to complete.
                    _pullComplete = true;
                }

https://github.com/googleapis/google-cloud-dotnet/blob/2e2659b8b970ba3e0cf2937f91016f1835fb4c0c/apis/Google.Cloud.PubSub.V1/Google.Cloud.PubSub.V1/SubscriberClientImpl.SingleChannel.cs#L232

WDYT?

jeffijoe · 2024-02-23T16:14:59Z

I just wrap the subscribe call in a while loop that checks my cancellation token.

lando2319 · 2024-04-11T10:24:11Z

any updates? I've been searching for months for a solution. The pubsub just stops listening, I've tried multiple service modules, I have a barebones test going and it drops within 6-8 hours every time. Nothing in logs or syslog. It just simply stops listening.

I don't have many more moves before we have to abandon pubsubs

Worse fear of these services, just simply doesn't work with no clues or solutions

seelrd · 2024-05-07T16:21:55Z

I also experienced random drops after a while.

lando2319 · 2024-05-07T16:48:38Z

any updates? I've been searching for months for a solution. The pubsub just stops listening, I've tried multiple service modules, I have a barebones test going and it drops within 6-8 hours every time. Nothing in logs or syslog. It just simply stops listening.

I don't have many more moves before we have to abandon pubsubs

Worse fear of these services, just simply doesn't work with no clues or solutions

Incase this helps someone, I was finally able to fix this by creating a separate package.json with only the needed packages for my services.

Somewhere in dependencies of dependencies there was a package causing it to stop working.

pietrodn · 2024-05-07T18:31:53Z

@lando2319 , very interesting! Would you be able to share a package.json for which the bug is present (and possibly the associated lockfile), and the one you made where it is not? Someone might be able to isolate the faulty package versions.

feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: pubsub Issues related to the googleapis/nodejs-pubsub API. labels Oct 21, 2020

feywind mentioned this issue Oct 22, 2020

Subscription errors and does not reconnect when connection is cancelled by server #979

Closed

feywind self-assigned this Dec 7, 2020

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Mar 19, 2021

yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2021

This comment has been minimized.

Sign in to view

meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p3 Desirable enhancement or fix. May not be included in next release. labels May 19, 2022

nguyen-vo mentioned this issue Apr 8, 2023

MessageStream retry logic calls setTimeout with a value higher than the maximum allowed delay value. #1680

Closed

kamalaboulhosn closed this as completed Jun 20, 2023

webnator mentioned this issue Jul 26, 2023

A subscription suddenly and silently stops receiving messages #1779

Closed

pocesar mentioned this issue Jan 3, 2024

pubsub stream silently stopping to pull messages yoshidan/google-cloud-rust#225

Open

Mihier-Roy mentioned this issue Mar 13, 2024

PubSub Subscriber with StreamingPull Suddenly Stops Regularly googleapis/google-cloud-dotnet#11793

Closed

Subscription streaming pull requests are disconnecting after 15 minutes #1135

Subscription streaming pull requests are disconnecting after 15 minutes #1135

Comments

feywind commented Oct 21, 2020

dpkirchner commented Dec 10, 2020

jeffijoe commented Dec 18, 2020

This comment has been minimized.

feywind commented Apr 26, 2021

This comment has been minimized.

feywind commented Apr 26, 2021

jeffijoe commented Apr 26, 2021

This comment has been minimized.

feywind commented Apr 26, 2021

This comment has been minimized.

feywind commented Apr 27, 2021

This comment has been minimized.

feywind commented Apr 27, 2021

murgatroid99 commented Apr 27, 2021

murgatroid99 commented Apr 27, 2021

This comment has been minimized.

This comment has been minimized.

jeffijoe commented Apr 27, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

meredithslota commented May 19, 2022

feywind commented Aug 17, 2022

edorivai commented Nov 24, 2022 • edited Loading

Normal operations

Backlog accumulating

Throughput RCA

Subscribers stop receiving messages

Workaround

Theory

Local debugging

feywind commented Nov 24, 2022

kamalaboulhosn commented Nov 28, 2022

edorivai commented Nov 28, 2022

modestaspruckus commented Apr 14, 2023

kamalaboulhosn commented Jun 20, 2023

dor-benatia-vcita commented Jun 25, 2023

webnator commented Jul 25, 2023

edosrecki commented Feb 8, 2024

philvmx1 commented Feb 23, 2024

jeffijoe commented Feb 23, 2024

philvmx1 commented Feb 23, 2024

philvmx1 commented Feb 23, 2024

jeffijoe commented Feb 23, 2024

lando2319 commented Apr 11, 2024

seelrd commented May 7, 2024

lando2319 commented May 7, 2024

pietrodn commented May 7, 2024

edorivai commented Nov 24, 2022 •

edited

Loading