Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable max rows per streaming request #237

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

FreCap
Copy link

@FreCap FreCap commented Sep 15, 2022

Due to BQ streaming put limitations, the max request size is 10MB.

Hence, considering that in average 1 record takes at least 20 bytes, if we have big batches (e.g. 500000) we might need to run against BigQuery multiple requests that would return a Request Too Large before finding the right size.

This config allows starting from a lower value altogether and reduce the amount of failed requests. Only works with simple TableWriter (no GCS)

Otherwise this can lead to

BigQueryException
Request payload size exceeds the limit: 10485760 bytes.

BigQueryException
Unexpected end of file from server

BigQueryException
Remote host terminated the handshake

BigQueryException
Error writing request body to server

@FreCap FreCap requested a review from a team as a code owner September 15, 2022 00:24
Copy link
Member

@b-goyal b-goyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @FreCap and apologies for the delay in reviewing. I have left a few comments, please take a look when you get chance.

@@ -93,6 +93,18 @@ public class BigQuerySinkConfig extends AbstractConfig {
"The interval, in seconds, in which to attempt to run GCS to BQ load jobs. Only relevant "
+ "if enableBatchLoad is configured.";

public static final String BQ_STREAMING_MAX_ROWS_PER_REQUEST_CONFIG = "bqStreamingMaxRowsPerRequest";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to - maxRowsPerRequest

@@ -93,6 +93,18 @@ public class BigQuerySinkConfig extends AbstractConfig {
"The interval, in seconds, in which to attempt to run GCS to BQ load jobs. Only relevant "
+ "if enableBatchLoad is configured.";

public static final String BQ_STREAMING_MAX_ROWS_PER_REQUEST_CONFIG = "bqStreamingMaxRowsPerRequest";
private static final ConfigDef.Type BQ_STREAMING_MAX_ROWS_PER_REQUEST_TYPE = ConfigDef.Type.INT;
private static final Integer BQ_STREAMING_MAX_ROWS_PER_REQUEST_DEFAULT = 50000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have the default behaviour same. We can use '-1' to say this is disabled and have that as the default

private static final ConfigDef.Type BQ_STREAMING_MAX_ROWS_PER_REQUEST_TYPE = ConfigDef.Type.INT;
private static final Integer BQ_STREAMING_MAX_ROWS_PER_REQUEST_DEFAULT = 50000;
private static final ConfigDef.Importance BQ_STREAMING_MAX_ROWS_PER_REQUEST_IMPORTANCE = ConfigDef.Importance.LOW;
private static final String BQ_STREAMING_MAX_ROWS_PER_REQUEST_DOC =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum number of rows to be sent in one batch in the request payload to bigquery.
This can reduce number of failed calls due to Request Too Large if the payload exceeds BigQuery specified quota limits. (https://cloud.google.com/bigquery/quotas#write-api-limits)
Setting it to a low value can result in degraded performance of the connector

"that would return a `Request Too Large` before finding the right size. " +
"This config allows starting from a lower value altogether and reduce the amount of failed requests. " +
"Only works with simple TableWriter (no GCS)";

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a validator as well with minimum and maximum values allowed.
-1 -> default
1 -> min
50,000 -> max (https://cloud.google.com/bigquery/quotas#write-api-limits)

@FreCap
Copy link
Author

FreCap commented Aug 28, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants