Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscription streaming pull requests are disconnecting after 15 minutes #1135

Closed
feywind opened this issue Oct 21, 2020 · 98 comments
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. external This issue is blocked on a bug with the actual product. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@feywind
Copy link
Collaborator

feywind commented Oct 21, 2020

This started out of this issue: #979

I just reverted the change that exposed the stream cancellations, but they are still happening under the covers. I think it would behoove us to figure out why. The commonality that I've seen so far is that it generally involves crossing a Docker networking boundary. A working reproduction is available on that issue above, though it only seems to happen on grpc-js, so it might also have something to do with the Node HTTP/2 stack.

Several question marks here, and I just don't want to lose track of that investigation, even if we fix the user breakage for now.

@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: pubsub Issues related to the googleapis/nodejs-pubsub API. labels Oct 21, 2020
@feywind feywind self-assigned this Dec 7, 2020
@dpkirchner
Copy link

Is this issue intended to track or resolve the reconnect issues from #979 and #770 in addition to the disconnects?

@jeffijoe
Copy link
Contributor

We are still seeing this issue, with the addition of the following being logged to console:

Connection to [object Object] rejected by server because of excess pings

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Mar 19, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2021
@mgabeler-lee-6rs

This comment has been minimized.

@feywind
Copy link
Collaborator Author

feywind commented Apr 26, 2021

Yikes... so the new behaviour is that it just sort of silently stops receiving messages?

Do you happen to have grpc debug output? You can get that by setting the environment variables:

export GRPC_VERBOSITY=DEBUG
export GRPC_TRACE=all

@mgabeler-lee-6rs

This comment has been minimized.

@feywind
Copy link
Collaborator Author

feywind commented Apr 26, 2021

And one more question, does your use case involve Docker? (GKE or whatnot...) I've found that that seems to be a commonality for this problem.

@jeffijoe
Copy link
Contributor

I think this is what we have been experiencing for the past few years, but it went from Call canceled to silent connection drop.

@feywind this issue is happening with the real PubSub service as well, same conditions. It appears that a quiet connection is closed. It does not happen in production for us (at least not yet) because it stays busy.

@mgabeler-lee-6rs

This comment has been minimized.

@feywind
Copy link
Collaborator Author

feywind commented Apr 26, 2021

My vague recollection of debugging one of the linked issues is that I only saw this happening (and it was really reproducible) when there was a Docker network boundary involved. So emulator in a container, client on GKE, or something like that. I worked with the grpc team to try to figure out why, and I don't think we ever found anything super useful in the grpc traces either. :( We thought it was at least worked around by way of papering over the reconnects, but it sounds like that's not happening either.

So I see two issues:

  1. The client library behaviour is not great for this error if it just sort of silently stops working. :|

For this one, I figured that letting it retry when disconnected would at least roll us back to where we were. It seems like maybe that's not working though...

  1. HTTP/2 connections mysteriously go away after a while, especially on Docker

This one I'm less sure about, but I can get my repro case going again and bug the grpc team to see if they can find anything else here. I still suspect weirdness between the Node HTTP/2 stack and Docker iptables rules used to do container isolation, but that's admittedly just a hunch so far.

@mgabeler-lee-6rs

This comment has been minimized.

@feywind
Copy link
Collaborator Author

feywind commented Apr 27, 2021

Thanks! We can also work out a non-GitHub-public-issue way to get the logs over, if that helps.

@mgabeler-lee-6rs

This comment has been minimized.

@feywind
Copy link
Collaborator Author

feywind commented Apr 27, 2021

@murgatroid99 might know the GRPC_TRACE answer.

@murgatroid99
Copy link

grpc-js uses the same environment variables, and GRPC_TRACE=all will still trace everything.

@murgatroid99
Copy link

If you want to narrow the trace output, the output format from grpc-js is <timestamp> | <tracer name> | <log text>. The value of the GRPC_TRACE environment variable can also be a comma-separated list of tracer names, and it will then log only those lines.

@mgabeler-lee-6rs

This comment has been minimized.

@mgabeler-lee-6rs

This comment has been minimized.

@jeffijoe
Copy link
Contributor

Grasping for straws here but I wonder if this is related: https://www.hostedgraphite.com/blog/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug

@mgabeler-lee-6rs

This comment has been minimized.

@mgabeler-lee-6rs

This comment has been minimized.

@mgabeler-lee-6rs

This comment has been minimized.

@mgabeler-lee-6rs

This comment has been minimized.

@meredithslota meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p3 Desirable enhancement or fix. May not be included in next release. labels May 19, 2022
@meredithslota
Copy link
Contributor

(Still marked external since we think it's an issue elsewhere but impacting users of this library — we just haven't been able to nail down what the specific issue is yet.) Any additional repro steps would be super helpful!

@feywind feywind added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Aug 15, 2022
@feywind
Copy link
Collaborator Author

feywind commented Aug 17, 2022

Linked to the meta-issue about transport problems: b/242894947

@edorivai
Copy link

edorivai commented Nov 24, 2022

To add a data point here. We're seeing something similar happen in our setup:

  • Pubsub topic
    • around 20K messages published per second
    • message ordering enabled
  • GKE deployment for subscribers with an autoscaling pool of 8-20 pods. (1 subscriber per pod)
    • Node.js subscribers using @google-cloud/[email protected]
    • Streaming pull (pubsub.subscription(name).on('message', ...))
  • subscribers seemingly randomly stop receiving events (more details below)
  • we are able to reproduce this without a docker network boundary present

Normal operations

The system performs as expected for hours on end, processing keeps up with publishing:
image

Backlog accumulating

However, zooming out we observe longer periods of time where the subscriber pool doesn't keep up:
image

Throughput RCA

We used to have CPU bottlenecks in our Postgres instance, but have successfully solved those by fronting with Redis.
image
Note the spikes up to 100% CPU before Nov 21. After Nov 21 we've implemented caching and confirmed that PG is not a bottleneck any longer.

Redis was never a bottleneck, but we did implement some in-memory caching to alleviate the load in redis as well
image

And to close the loop, our GKE pool is autoscaling successfully, so no CPU bottlenecks there either
image

All this to say that we're pretty sure that the bottleneck lies elsewhere.

Subscribers stop receiving messages

We added some custom logging to our subscribers, which accumulates counts on how many messages are processed at 10 sec intervals.

When operating normally, it looks like this:
image
That first log line indicates that one subscriber processed 34050 messages in 10 seconds, that count is also added to the summary line as seen in the subsequent log lines.

We started noticing that when a backlog accumulates, multiple subscribers are not processing any messages:
image

This brings down the processing rate of the subscriber pool, causing the backlog to grow.
image

We added listeners for error, close and debug events on the subscription, but none of those events fire when this happens.

Workaround

As a hail-mary, we decided to implement the workaround mentioned here - basically turning it off and on again 🎉. If a subscriber is detected to not receive any events, we .close() and immediately .open() the subscription.

This has been very successful so far:
image
This graphs the number of log lines with messages=0, the workaround was deployed right before 18:00.

It has not fixed the problem 100%, the fix was deployed yesterday, nov 23. We still observe some backlog spikes today (nov 24), but it does seem that the system is able to recover way more quickly than before the fix.
image

Theory

I'm not familiar with the inner workings of pubsub, but I have one theory of what might cause this:

  • message ordering is turned on, so it's probably sharding the messages under the hood
  • I imagine that to assure subscriber affinity, a single shard can only be serviced by a single subscriber at any time
  • due to qualitative differences between messages, some shards may take substantially longer to process individual messages than others
  • this would cause the queue in the shard to grow faster than other shards
  • pubsub decides it would like to split up the big (or slow) shard
  • to do so, it needs to flush it first
  • the shard stop receiving messages
  • at some point it's empty, the subscriber connected to the shard stops receiving messages
  • the ordering keys originally assigned to the slow shard are now distributed over other (potentially new) shards
  • for whatever reason, the subscriber doesn't reconnect to other shards (or perhaps it takes a really long time)
  • us restarting the subscriber connection (close, then open) allows the subscriber to disconnect from the flushed shard and reconnect to another one
  • it starts receiving messages again

Local debugging

To debug this, I ran one subscriber instance on my local machine. Note: in contrast to the production GKE deployment, this local Node.js process did not run within docker.

I observed this subscriber receiving messages, then dropping down to 0 messages over 10-sec intervals. That triggered the restart a couple of times, then at some point it did start receiving messages again. This pattern repeats a couple times.
image

If my theory is half-correct, I could imagine this subscriber connecting to an empty shard a couple of times, until at some point the restart allowed it to connect to a live shard.

Hope this helps, FWIW we have a running support contract with GCP and I'm open to hop on a call and poke at our system with somebody from GCP.

@feywind
Copy link
Collaborator Author

feywind commented Nov 24, 2022

@edorivai Thanks for the really detailed comment! I think we're seeing this in situations other than just ordering keys, but let me tag in @kamalaboulhosn in regards to the service-side speculation. (Kamal: look for the Theory heading above)

Most of the related team is off for US holidays right now, so it may be a bit.

We still haven't been able to find the fire to go with the smoke on this issue, so I am starting to think maybe we should just temporarily implement the workaround everyone is using anyway ("have you tried turning off and back on again?" :) I don't think it's something we should encourage for regular use or leave in there indefinitely, but I also don't like users having to deal with it for so long.

@kamalaboulhosn
Copy link
Contributor

This isn't exactly how Pub/Sub's sharding of ordering keys works on the subscribe side. The only trigger that would cause changing the association of ordering keys with a subscriber would be the addition of new subscriber clients that need to receive messages. That would result in no messages being sent out on some set of keys until the outstanding messages for those keys are acknowledged. However, this is not tied to the publishing of messages, so the shard does not need to be empty.

It is possible for subscribers to be assigned a set of keys that have no publishes as we don't balance keys based on load to subscribers. This would depend greatly on the number of subscriber clients and the diversity of ordering keys used.

@edorivai
Copy link

@kamalaboulhosn thank you for that context!

We explicitly tested whether auto-scaling (changing the no. of subscribers) caused these throughput issues. We basically ran our GKE workload on a fixed number (12) of pods. Even under those constraints, we saw a lot of periods where many (more than 50% of the pool) subscribers would receive no messages.

Additionally, in our case, the number of ordering keys are fairly constant over time.

@modestaspruckus
Copy link

Hey, I'm experiencing same issue. My subscribers with flow control are not pulling any new unacked messages. I saw this happens at exact time when invalid_argument errors appears on streaming pull API

image

package version 3.4.1

Tried to disable exactlyOnceDelivery - still same.

@kamalaboulhosn
Copy link
Contributor

This issue has covered a lot of different causes and investigations that are not entirely related including issues around ordered, unordered, and exactly once subscriptions. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks!

@dor-benatia-vcita
Copy link

why was this closed ? i mean closing and opening subscription every 15 minutes does work but as a workaround - I am very sceptic this is the good practice.

@webnator
Copy link

I don't really get why was this closed. This is still an issue, I'm getting the same error using

  • @google-cloud/pubsub": "^3.7.1
  • node:18-alpine

I have all listeners properly implemented (on 'error', 'close') yet, when the error happens is not being handled by any of them, and just silently stops listening. I think turning it on and off again is not a very clean solution for this, specially in production.

@edosrecki
Copy link

I did a lot of investigation into this issue and decided to open up a follow-up issue with my findings: #1885

@philvmx1
Copy link

also an issue with dotnet client PubSub.V1 but only with OrderingKey (we have only a few) - so it's gotta be a core issue related to that. I think this needs to be escalated to the core pubsub not client libs.

@jeffijoe
Copy link
Contributor

@philvmx1 we also use the Dotnet client but no ordering keys, same issue for us there. At least in Dotnet we can detect it and reconnect.

@philvmx1
Copy link

At least in Dotnet we can detect it and reconnect.

@jeffijoe can you share a snippet of how you are doing that please?

@philvmx1
Copy link

Looking at source code, I wonder if the server is hanging up causing IsRpcCancellation thus breaking the while loop.

                if (next.IsPull && (task.IsCanceled || (task.IsFaulted && (task.Exception.IsCancellation() || task.Exception.IsRpcCancellation()))))
                {
                    // Pull has been cancelled by user, shutdown pull stream and don't run continuation.
                    // RPC exceptions are dealt with in the relevant handlers.
                    StopStreamingPull();
                    // Pull process has been stopped, wait for push process to complete.
                    _pullComplete = true;
                }

https://github.com/googleapis/google-cloud-dotnet/blob/2e2659b8b970ba3e0cf2937f91016f1835fb4c0c/apis/Google.Cloud.PubSub.V1/Google.Cloud.PubSub.V1/SubscriberClientImpl.SingleChannel.cs#L232

WDYT?

@jeffijoe
Copy link
Contributor

I just wrap the subscribe call in a while loop that checks my cancellation token.

@lando2319
Copy link

any updates? I've been searching for months for a solution. The pubsub just stops listening, I've tried multiple service modules, I have a barebones test going and it drops within 6-8 hours every time. Nothing in logs or syslog. It just simply stops listening.

I don't have many more moves before we have to abandon pubsubs

Worse fear of these services, just simply doesn't work with no clues or solutions

@seelrd
Copy link

seelrd commented May 7, 2024

I also experienced random drops after a while.

@lando2319
Copy link

any updates? I've been searching for months for a solution. The pubsub just stops listening, I've tried multiple service modules, I have a barebones test going and it drops within 6-8 hours every time. Nothing in logs or syslog. It just simply stops listening.

I don't have many more moves before we have to abandon pubsubs

Worse fear of these services, just simply doesn't work with no clues or solutions

Incase this helps someone, I was finally able to fix this by creating a separate package.json with only the needed packages for my services.

Somewhere in dependencies of dependencies there was a package causing it to stop working.

@pietrodn
Copy link

pietrodn commented May 7, 2024

@lando2319 , very interesting! Would you be able to share a package.json for which the bug is present (and possibly the associated lockfile), and the one you made where it is not? Someone might be able to isolate the faulty package versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. external This issue is blocked on a bug with the actual product. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests