Skip to content

Conversation

@sfc-gh-aminyaylov
Copy link
Contributor

@sfc-gh-aminyaylov sfc-gh-aminyaylov commented Oct 14, 2025

Problem

When the primary deployment URL is updated, the client throws a CHANNEL_INVALID​ error. The user responds by re-opening the channel, which succeeds. However, ingestion continues to fail due to the deployment mismatch.

Solution

Client is automatically closed if it detects the primary deployment has changed, forcing the user to recreate the client.

The following errors indicate a primary deployment reconfiguration and will immediately trigger a client close:

  1. Client-side deployment ID mismatch: we are uploading to a different bucket location. The bucket location change occurs during periodic storage credential refresh, which calls Snowflake using the current URL.
  2. Server-side deployment ID mismatch: we are registering a file with the wrong deployment
  3. Server-side encryption key mismatch: we are registering a file with an invalid encryption key

@sfc-gh-aminyaylov sfc-gh-aminyaylov force-pushed the aminyaylov-client-reconfigure branch from f3c1432 to 0e9fbbf Compare October 14, 2025 02:17
@sfc-gh-aminyaylov sfc-gh-aminyaylov force-pushed the aminyaylov-client-reconfigure branch from 0e9fbbf to 186b94a Compare October 22, 2025 01:16
@sfc-gh-aminyaylov sfc-gh-aminyaylov changed the title Draft client reconfigure mechanism on client redirect Close client on Oct 22, 2025
Copy link
Contributor Author

sfc-gh-aminyaylov commented Oct 22, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sfc-gh-aminyaylov sfc-gh-aminyaylov changed the title Close client on [SNOW-2193898] Close client on primary deployment change Oct 22, 2025
@sfc-gh-aminyaylov sfc-gh-aminyaylov marked this pull request as ready for review October 22, 2025 01:40
@sfc-gh-aminyaylov sfc-gh-aminyaylov force-pushed the aminyaylov-client-reconfigure branch from a26d901 to 4c4c918 Compare October 22, 2025 01:41
Copy link
Contributor

@sfc-gh-psaha sfc-gh-psaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am onboard with the high level approach - I'm delegating to @sfc-gh-hmadan for the review.

.forEach(
channelStatus -> {
if (channelStatus.getStatusCode() != RESPONSE_SUCCESS) {
if (isTerminalError(channelStatus.getStatusCode())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are expecting that the service response will still be a valid HTTP 200 response, on which response.getBlobsStatus can successfully fire.
In this situation of failover, I'd expect the service to return. non-200 response (at line 626 : snowflakeServiceClient.registerBlob(request, executionCount)).
Why would we make the service return a HTTP 200, with the inner per-chunk details inside the response payload have this terminal error code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a rabbit hole. The DEPLOYMENT_ID_MISMATCH is determined while processing chunks. There are many chunks per request. Therefore, these errors accumulate into a 200 OK response object. Meanwhile, the INVALID_ENCRYPTION_KEY error is checked immediately and thrown as a 400 BAD REQUEST. So I split them up and am now handling them separately. PTAL.

* @param statusCode the server response status code
* @return true if terminal error, false otherwise
*/
private static boolean isTerminalError(long statusCode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we remove the private qualifier can RegisterService also call this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're different signatures. The errors coming from deeper in the SDK stack (like staging location mismatch) are thrown as SFExceptions. Meanwhile, this method is dealing with response codes from the server JSON response object.

+ "}",
dbName, schemaName, tableName, channelName, channelSequencer);

apiOverride.addSerializedJsonOverride(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a true E2E for validating this change works as expected, without any mocks in the picture?
Please check with Alec on how to do so. SDK E2E will only work against prod, by the way.

@sfc-gh-cqu
Copy link

Thanks for working on this @sfc-gh-aminyaylov! I didn't go through the actual changes, but from the description it might affect the preprod replication testing which has a test case for client redirect. Could you help check if any update to the test is needed when releasing the new SDK? Thanks in advance! https://dp-telemetry-and-streaming-ingest-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SSV1ReplicationTestClientRedirect/
https://github.com/snowflakedb/snowflake/blob/599c7a27370fca42604e1587dffdf8cf5e332678/Tests/system_tests/snowpipe_streaming_replication_test/ssv1_replication_test/tests/ssv1_replication_client_redirect_test.py

@sfc-gh-aminyaylov sfc-gh-aminyaylov changed the title [SNOW-2193898] Close client on primary deployment change [SNOW-2464956] Close client on primary deployment change Oct 27, 2025
Copy link
Contributor Author

Thanks for working on this @sfc-gh-aminyaylov! I didn't go through the actual changes, but from the description it might affect the preprod replication testing which has a test case for client redirect. Could you help check if any update to the test is needed when releasing the new SDK? Thanks in advance! https://dp-telemetry-and-streaming-ingest-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SSV1ReplicationTestClientRedirect/
https://github.com/snowflakedb/snowflake/blob/599c7a27370fca42604e1587dffdf8cf5e332678/Tests/system_tests/snowpipe_streaming_replication_test/ssv1_replication_test/tests/ssv1_replication_client_redirect_test.py

Yeah, good point. The flow is identical except for step 6, where we close the client instead of invalidating the channel. The customer will get a CLOSED_CLIENT error if making any additional calls before recreating.

I'll update the tests.

&& ire.getErrorBody().getCode() != null
&& ire.getErrorBody()
.getCode()
.equals(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do a case-insensitive string comparison? Or is every other place in the SDK doing a case sensitive check anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants