-
Notifications
You must be signed in to change notification settings - Fork 68
[SNOW-2464956] Close client on primary deployment change #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
f3c1432 to
0e9fbbf
Compare
...ain/java/net/snowflake/ingest/streaming/internal/SnowflakeStreamingIngestClientInternal.java
Outdated
Show resolved
Hide resolved
0e9fbbf to
186b94a
Compare
a26d901 to
4c4c918
Compare
sfc-gh-psaha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am onboard with the high level approach - I'm delegating to @sfc-gh-hmadan for the review.
...ain/java/net/snowflake/ingest/streaming/internal/SnowflakeStreamingIngestClientInternal.java
Outdated
Show resolved
Hide resolved
| .forEach( | ||
| channelStatus -> { | ||
| if (channelStatus.getStatusCode() != RESPONSE_SUCCESS) { | ||
| if (isTerminalError(channelStatus.getStatusCode())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are expecting that the service response will still be a valid HTTP 200 response, on which response.getBlobsStatus can successfully fire.
In this situation of failover, I'd expect the service to return. non-200 response (at line 626 : snowflakeServiceClient.registerBlob(request, executionCount)).
Why would we make the service return a HTTP 200, with the inner per-chunk details inside the response payload have this terminal error code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a rabbit hole. The DEPLOYMENT_ID_MISMATCH is determined while processing chunks. There are many chunks per request. Therefore, these errors accumulate into a 200 OK response object. Meanwhile, the INVALID_ENCRYPTION_KEY error is checked immediately and thrown as a 400 BAD REQUEST. So I split them up and am now handling them separately. PTAL.
| * @param statusCode the server response status code | ||
| * @return true if terminal error, false otherwise | ||
| */ | ||
| private static boolean isTerminalError(long statusCode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we remove the private qualifier can RegisterService also call this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're different signatures. The errors coming from deeper in the SDK stack (like staging location mismatch) are thrown as SFExceptions. Meanwhile, this method is dealing with response codes from the server JSON response object.
| + "}", | ||
| dbName, schemaName, tableName, channelName, channelSequencer); | ||
|
|
||
| apiOverride.addSerializedJsonOverride( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a true E2E for validating this change works as expected, without any mocks in the picture?
Please check with Alec on how to do so. SDK E2E will only work against prod, by the way.
|
Thanks for working on this @sfc-gh-aminyaylov! I didn't go through the actual changes, but from the description it might affect the preprod replication testing which has a test case for client redirect. Could you help check if any update to the test is needed when releasing the new SDK? Thanks in advance! https://dp-telemetry-and-streaming-ingest-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SSV1ReplicationTestClientRedirect/ |
a486b2b to
0505045
Compare
Yeah, good point. The flow is identical except for step 6, where we close the client instead of invalidating the channel. The customer will get a I'll update the tests. |
| && ire.getErrorBody().getCode() != null | ||
| && ire.getErrorBody() | ||
| .getCode() | ||
| .equals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do a case-insensitive string comparison? Or is every other place in the SDK doing a case sensitive check anyway?

Problem
When the primary deployment URL is updated, the client throws a
CHANNEL_INVALID error. The user responds by re-opening the channel, which succeeds. However, ingestion continues to fail due to the deployment mismatch.Solution
Client is automatically closed if it detects the primary deployment has changed, forcing the user to recreate the client.
The following errors indicate a primary deployment reconfiguration and will immediately trigger a client close: