Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscription errors and does not reconnect when connection is cancelled by server #979

Closed
Thr1ve opened this issue Apr 22, 2020 · 37 comments · Fixed by #1132
Closed

Subscription errors and does not reconnect when connection is cancelled by server #979

Thr1ve opened this issue Apr 22, 2020 · 37 comments · Fixed by #1132
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Thr1ve
Copy link

Thr1ve commented Apr 22, 2020

After upgrading @google-cloud/pubsub to escape a cpu issue in @grpc/grpc-js, we found that services using a long-running pubsub subscription began to crash periodically during local development and in GKE.

Issue

When the pubsub service closes the underlying http2 connecton (this happens after 15 minutes), the connected subscription instance emits an error event and does not reconnect.

In a previous version of the library, the subscription instance would reconnect (and wouldn't emit an error event). However, it looks like the CANCELLED status was removed from the RETRY_CODES list found in src/pull-retry.ts in this commit, which means we skip the the retry block and move on to destroy the stream with a StatusError in src/message-stream.ts here

Since the pubsub service reserves the right to cancel a connection at any time, I would expect this library to handle reconnecting when the pubsub service cancels said connection and not emit an error message on the subscription instance.

The simplest workaround I've found so far is to manually add the retry code for CANCELLED back into the RETRY_CODES list before requiring @google-cloud/pubsub proper:

const { RETRY_CODES } = require('@google-cloud/pubsub/build/src/pull-retry');
RETRY_CODES.push(1);

const { PubSub } = require('@google-cloud/pubsub');

Environment details

  • OS: GKE
  • Node.js version: 13.11.0
  • package versions:
    • @google-cloud/pubsub: 1.7.2
    • google-gax: 1.15.2
    • @grpc/grpc-js: 0.7.9

Steps to reproduce

The simplest way to reproduce this issue is to create a subscription to a topic using the pubsub-emulator and wait 15 minutes.

Example reproduction code:

const { PubSub } = require('@google-cloud/pubsub');

const options = {
  projectId: 'local-dev',
  apiEndpoint: 'localhost:8085',
};

const pubsub = new PubSub(options);

const topicName = 'test-topic';
const subscriptionName = 'test-topic-subscription';

const run = async () => {
  const topicDefinition = pubsub.topic(topicName);

  await topicDefinition.get({ autoCreate: true });

  const subscriptionDefinition = topicDefinition.subscription(subscriptionName);

  const [subscription] = await subscriptionDefinition.get({ autoCreate: true });

  subscription.on('message', (message) => {
    console.log('message received', { message });
  });

  subscription.on('error', (error) => {
    console.error(`Error in pubsub subscription: ${error.message}`, {
      error, topicName, subscriptionName,
    });
  });

  subscription.on('close', () => {
    console.error('Pubsub subscription closed', {
      topicName, subscriptionName,
    });
  });

  console.log('handlers added to subscription');
};

run();

Running the above script without sending any messages to the topic yielded this error after 15 minutes:

Error in pubsub subscription: Call cancelled {
  error: StatusError: Call cancelled
      at MessageStream._onEnd ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:234:26)
      at MessageStream._onStatus ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:271:18)
      at ClientDuplexStreamImpl.<anonymous> ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:143:44)
      at Object.onceWrapper (events.js:422:26)
      at ClientDuplexStreamImpl.emit (events.js:315:20)
      at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client.js:391:24)
      at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:303:181)
      at Http2CallStream.outputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:114:27)
      at Http2CallStream.maybeOutputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:153:22)
      at Http2CallStream.endCall ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:140:18) {
    code: 1,
    details: 'Call cancelled',
    metadata: Metadata { internalRepr: Map(0) {}, options: {} }
  },
  topicName: 'test-topic',
  subscriptionName: 'test-topic-subscription'
}
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Apr 22, 2020
@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Apr 22, 2020
@feywind
Copy link
Collaborator

feywind commented Apr 22, 2020

@hongalex Hey Alex, these default / retry updates were something you worked on. Did you have any thoughts on the change that led to this?

@hongalex
Copy link
Member

Thanks for filing this issue! Indeed, the timeout for a streaming pull stream is 15 minutes, but it shouldn't be returning a CANCELLED status code.

We removed CANCELLED as a retryable error per https://aip.dev/194, so we're currently investigating where the status is being surfaced from.

@feywind
Copy link
Collaborator

feywind commented Apr 24, 2020

I haven't been able to get the repro code to work on a local emulator (it doesn't time out) but I'll try again tomorrow with the real server. There's some thought that there may be a race condition between the client library timing out and the server timing out. There's also this possibility:

grpc/grpc-node#1391

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jul 23, 2020
@gregberge
Copy link

Hello, we experienced the same problem in our application. Any news on that?

@jeffijoe
Copy link
Contributor

It seems that instead of the Call cancelled error that now it just stops the subscription, however for some reason the close event is not triggered?

@giuliano-barberi-tf
Copy link

@feywind Can we get this issue bumped up to a p1 and prioritized? It seems a subscription that is idle for 15 minutes just completely loses connectivity and the client doesn't properly reconnect. Otherwise, if we are supposed to handle reconnecting ourselves then can we get some guidance on how to do that? Thank you!

@feywind feywind self-assigned this Sep 15, 2020
@jvvcn
Copy link

jvvcn commented Oct 1, 2020

Having save error in the same situation after 15 mins of retries. Do you guys have any updates on issue?

@feywind
Copy link
Collaborator

feywind commented Oct 1, 2020

Hi everyone,

We've been looking at this today, and it seems like it's likely to be something specific to the Node library. Unfortunately I've also been unable to reproduce it locally, across 3 different Node versions, which leads me to wonder if there's something going on between Node and GKE.

Re: prioritzation, that is indeed happening right now. If this is actively blocking you, you might try the workaround suggested in the original post (adding CANCELLED back to the retry list), but I can't really recommend that as a long term fix. I think we were just papering over an issue before.

@jeffijoe
Copy link
Contributor

jeffijoe commented Oct 1, 2020

I am on Node 14.8, macOS Catalina and using the PubSub emulator in Docker (cloudfind/google-cloud-emulators:latest).

Re: CANCELLED, it used to throw the CANCELLED error, but now it just disconnects silently with no way to retry it in userland.

@feywind
Copy link
Collaborator

feywind commented Oct 2, 2020

Those are good data points, thank you!

@feywind
Copy link
Collaborator

feywind commented Oct 2, 2020

I've duplicated that setup here, and it seems I still can't reproduce it.

I'm using an almost unmodified version of the original repro above. So it seems like there has got to be some kind of difference in environment, if it's consistent for you all. Here's what I have:

  • macOS 10.15.6
  • Node 14.8.0
  • @google-cloud/pubsub 2.5.0
  • Fully removed node_modules and npm i
  • Docker macOS desktop 2.4.0.0 (engine 19.03.13)
  • cloudfind/google-cloud-emulators latest 3501167e9a44 3 days ago 2.23GB
  • Running the container like so: docker run -it -p 8085:8085 --rm cloudfind/google-cloud-emulators:latest
  • Inside the container: gcloud beta emulators pubsub start --host-port=172.17.0.2:8085 --project=local-dev
  • The test sets up correctly with handlers added to subscription, and I do see messages in the emulator log to that effect.
  • Wait half an hour or more
  • No new messages in the console for the sample.

This log pops up in the emulator window, but I'm unsure if it's related. Does it look familiar?

[pubsub] io.netty.handler.codec.http2.Http2Exception$StreamException: Cannot create stream 29 greater than Last-Stream-ID 27 from GOAWAY.

Hopefully there's something useful in there and we can compare notes to see what's potentially different.

@mrothroc
Copy link

mrothroc commented Oct 3, 2020

We have this issue in GKE. Kubernetes and the nodes are all on 1.16.13-gke.401.

Our container is FROM node:10-alpine. We are using v2.1.0 of @google-cloud/pubsub.

This is our production workload. We have an active support contract and I am happy to file a ticket if it would be helpful.

@giuliano-barberi-tf
Copy link

I was able to reproduce this with the nodejs and .NET pubsub client actually with the same setup as you @feywind. It doesn't happen when I use the pubsub emulator directly in the CLI. Only difference is I am on 10.15.3.

@jeffijoe
Copy link
Contributor

jeffijoe commented Oct 5, 2020

Same, it appears it works when using the Emulator on the host machine directly but not through Docker.

@feywind
Copy link
Collaborator

feywind commented Oct 7, 2020

@mrothroc It might, actually. Let me see if I can get you in touch with the TSE that's looking at this.

@feywind
Copy link
Collaborator

feywind commented Oct 7, 2020

Also @giuliano-barberi-tf do you have a .NET snippet that you're using for reproducing it? That might be a helpful data point too.

@giuliano-barberi-tf
Copy link

giuliano-barberi-tf commented Oct 7, 2020

Also @giuliano-barberi-tf do you have a .NET snippet that you're using for reproducing it? That might be a helpful data point too.

I do not have something easy to share. I'm just seeing this in my application running locally but it's fairly vanilla just using SubscriberClient. Something like the sample code at https://googleapis.github.io/google-cloud-dotnet/docs/Google.Cloud.PubSub.V1/ will work to repro it and you can see it stops listening after the StartAsync call:

await subscriber.StartAsync(...)
Console.WriteLine("he's dead jim");

After 15 minutes I see the output

Thanks!

@jeffijoe
Copy link
Contributor

jeffijoe commented Oct 8, 2020

@feywind here are the full logs as discussed. Looking through them myself around the end, I noticed

D1008 17:40:01.637550000 4599324096 call.cc:734]                       {"created":"@1602193201.637495000","description":"Error received from peer ipv6:[::1]:10050","file":"../deps/grpc/src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Received RST_STREAM with error code 8","grpc_status":1}

Which AFAIK is CANCELLED.

log.txt

@feywind
Copy link
Collaborator

feywind commented Oct 9, 2020

@jeffijoe Do you mind sending me the package.json and the log-patched test code you're using? I think I see something interesting here, but I still want to try reproducing it here before guessing.

@jeffijoe
Copy link
Contributor

jeffijoe commented Oct 9, 2020

@feywind of course! I attached all files as a zip.

The apiEndpoint is configured in client.js.

repro.zip

@feywind
Copy link
Collaborator

feywind commented Oct 10, 2020

Thanks! I'll take a look Monday.

@feywind
Copy link
Collaborator

feywind commented Oct 16, 2020

@jeffijoe Hey Jeff,

Somehow, even using your repro files with the lockfile and all, I haven't been able to get it to reproduce here. It ran for almost an hour with no errors, and at the end of that, I could still publish to it using the pubber.js.

Considering how similar our test environments are, now, I'm starting to wonder if there's some sort of firewall issue that's causing grpc issues with an idle connection. Either way, I'll keep digging on it tomorrow. Thanks for the patience.

@mrothroc
Copy link

We are running in GKE and so far as I know there is no firewall. Not sure what sits inside the Google cloud between GKE and pubsub, but we don't have anything.

@mrothroc
Copy link

We just had it happen again. Everything will be fine, then it suddenly isn't. Red bars show the issue. I left a partial stack trace in the image, just in case that's useful.
Screen Shot 2020-10-16 at 8 38 57 AM

@feywind
Copy link
Collaborator

feywind commented Oct 16, 2020

Ah yeah, that actually is pretty interesting. Thanks!

Also, within a Mac running Docker with the emulator, there wouldn't be any external firewalls. I know sometimes rules are set on individual machines by corporate policies or whatnot. Those would naturally be different between us, so I don't know if that would cause the variance that makes it hard for me to reproduce. GKE might do something similar?

Also I think that whether we can reproduce it or not, ultimately the error handling in the client library could probably help with this. So I might start thinking in that direction.

@feywind
Copy link
Collaborator

feywind commented Oct 16, 2020

Hey, good(?) news!

2020-10-16T21:50:36.408Z | call_stream | [2] HTTP/2 stream closed with code 8
2020-10-16T21:50:36.408Z | call_stream | [2] ended with status: code=1 details="Call cancelled"
2020-10-16T21:50:36.409Z | subchannel | 127.0.0.1:8085 callRefcount 5 -> 4
STREAM unpipe 2020-10-16T21:50:36.409Z was running for 00:14:58.988

I only saw this with grpc-js, which is also an interesting data point. I'll be bugging a grpc person about this probably.

@feywind
Copy link
Collaborator

feywind commented Oct 19, 2020

Quick update, that conversation is happening now.

@giuliano-barberi-tf
Copy link

Quick update, that conversation is happening now.

Thank you for working so hard on this issue!

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Oct 19, 2020
@feywind
Copy link
Collaborator

feywind commented Oct 19, 2020

Hey, sorry it's taken so long!

I've decided for now that I'm going to re-add CANCELLED to the retry list to get everyone back and going. I'll make a separate issue for fixing whatever is causing the CANCELLED.

@feywind
Copy link
Collaborator

feywind commented Oct 20, 2020

Waiting for release and new issue creation here.

@feywind
Copy link
Collaborator

feywind commented Oct 22, 2020

@jeffijoe 2.6.0 should now be available, adding CANCELLED back to the retry codes. This is just an interim thing to get you going again. If the cancels are happening, it might be causing a secondary issue where the emulator becomes unhappy about too many connections (after a long time). I'm still investigating why it's happening in the first place in #1135.

@jeffijoe
Copy link
Contributor

Will give it a try! 🙏

@feywind
Copy link
Collaborator

feywind commented Oct 27, 2020

I'm going to go ahead and close this for now, please feel free to re-open if 2.6.0 didn't get you going again. I'm going to continue my investigation of why the disconnects are happening over in #1135.

@feywind feywind closed this as completed Oct 27, 2020
@jeffijoe
Copy link
Contributor

Its definitely more stable now but I do have cases where I still need to restart.

As an FYI this gets logged occasionally, usually after waking up my computer:

Connection to [object Object] rejected by server because of excess pings

@zombieleet
Copy link

We are currently experiencing this same issue using gcloud-sdk docker image. What's the possible fix ? @feywind

@jeffijoe
Copy link
Contributor

jeffijoe commented Jan 5, 2021

Sorry to be that guy but we're still having this issue. Disregard my earlier comment on being more stable, it's been the same, with the exception of now seeing a console.log of Connection to [object Object] rejected by server because of excess pings but still no event being emitted that would at least let me reconnect manually.

@vqcheslav-kara
Copy link

Hello,
we also get this Error on production and it is very very annoying because we have downtime caused by it
StatusError: Call cancelled at MessageStream._onEnd (/home/api-gateway/build/node_modules/@google-cloud/pubsub/build/src/message-stream.js:234:26) at MessageStream._onStatus (/home/api-gateway/build/node_modules/@google-cloud/pubsub/build/src/message-stream.js:271:18) at ClientDuplexStreamImpl.<anonymous> (/home/api-gateway/build/node_modules/@google-cloud/pubsub/build/src/message-stream.js:143:44) at Object.onceWrapper (events.js:422:26) at ClientDuplexStreamImpl.emit (events.js:315:20) at Object.onReceiveStatus (/home/api-gateway/build/node_modules/@google-cloud/pubsub/node_modules/@grpc/grpc-js/build/src/client.js:392:24) at Object.onReceiveStatus (/home/api-gateway/build/node_modules/@google-cloud/pubsub/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181) at /home/api-gateway/build/node_modules/@google-cloud/pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78 at processTicksAndRejections (internal/process/task_queues.js:79:11) { code: 1, details: 'Call cancelled', metadata: Metadata { internalRepr: Map {}, options: {} } }

this is the code where we get the error
subscription.on("error", async (msg) => { console.error( msg); });
the options that we set on subscription error:
{ streamingOptions: { maxStreams: 1 }, }
we have ~50 subscriptions
the NodeJS services are running in Compute Engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet