-
Notifications
You must be signed in to change notification settings - Fork 220
Support per-catalog configuration for project and location for BigQueryCatalog #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/gcbrun |
Summary of ChangesHello @BoqianShi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the BigQueryCatalog by introducing support for per-catalog configuration of BigQuery project IDs and locations. This allows Spark users to define specific BigQuery projects and geographical regions for each registered catalog, providing greater flexibility when interacting with diverse BigQuery datasets. The change streamlines workflows that involve reading from public datasets in one project and writing to tables in a different project or region, eliminating the need to manage global overrides and simplifying complex data pipelines. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
2f415cf to
5fdf9ac
Compare
|
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for per-catalog configuration of project and location for the BigQueryCatalog. The changes correctly plumb the new configuration options through to the BigQueryClient used by the catalog for namespace-level operations. However, there is a significant issue where these catalog-level configurations are not propagated to the Table objects returned by loadTable and createTable. This means that table-level operations, such as reads and writes, will not use the specified project or location from the catalog configuration, making the feature incomplete. I've added a detailed comment with a suggested fix for this issue.
...v2/spark-3.5-bigquery-lib/src/main/java/com/google/cloud/spark/bigquery/BigQueryCatalog.java
Outdated
Show resolved
Hide resolved
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
/gcbrun |
|
|
||
| public void createDataset(DatasetId datasetId, Map<String, String> metadata) { | ||
| DatasetInfo.Builder datasetInfo = DatasetInfo.newBuilder(datasetId); | ||
| Optional.ofNullable(bigQuery.getOptions().getLocation()).ifPresent(datasetInfo::setLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment for this line explaining why it is there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a non-catalog scenario, both BigQueryOptions.quotaProjectId and BigQueryOptions.projectId will be set to the parent projectId, as before.
| config.parentProjectId = | ||
| getAnyOption(globalOptions, options, "parentProject").or(defaultBilledProject()); | ||
| config.catalogProjectId = getOption(options, "projectId"); | ||
| config.catalogLocation = getOption(options, "location"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the parameter location or bigquery_location?
| com.google.common.base.Optional<String> temporaryGcsBucket = empty(); | ||
| com.google.common.base.Optional<String> persistentGcsBucket = empty(); | ||
| com.google.common.base.Optional<String> persistentGcsPath = empty(); | ||
| com.google.common.base.Optional<String> catalogProjectId = empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment why do we need a separate set of properties
| .setRetrySettings(config.getBigQueryClientRetrySettings()) | ||
| .setUniverseDomain(bigQueryCredentialsSupplier.getUniverseDomain()); | ||
|
|
||
| // Determine the project for data location (the catalog's project, if specified) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this is used in non-catalog scenario?
|
/gcbrun |
1 similar comment
|
/gcbrun |
5132d73 to
f56cad8
Compare
|
/gcbrun |
|
/gcbrun |
1 similar comment
|
/gcbrun |
8f6dcee to
4c90a8f
Compare
|
/gcbrun |
305ed1f to
5ec52be
Compare
|
/gcbrun |
5ec52be to
98dff65
Compare
98dff65 to
2a3b78a
Compare
|
/gcbrun |
This PR enhances the BigQueryCatalog by adding support for specifying a project and location directly in the Spark configuration. This allows users to easily register and interact with BigQuery datasets from different projects and locations without having to override the global parentProjectId.
This makes it easier to work with datasets in different projects and regions. For example, you can now read from a public dataset in one project and write to a table in a specific location (e.g., "US") in your own project.
How to Use
Users can now configure a BigQuery catalog with a specific project and location like this: