Skip to content

[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

urosstan-db
Copy link
Contributor

@urosstan-db urosstan-db commented May 14, 2025

What changes were proposed in this pull request?

  • Add new API in DSv2 TableCatalog class, called listTableSummaries. Return value should be array of TableSummary objects.
  • Added table type property which should represent a type of v2 table.
  • When conversion from v1 table to v2 table happens, we also provided table type property to v2 table.
  • JDBCTableCatalog would invoke listTables and treat all tables as foreign. On that way, we need just one SQL command on remote system to get table summaries.

Why are the changes needed?

Since DSv2 Table class can represent different table types, we currently need to do listTables + loadTable operations. Which can be expensive.

I propose adding new interface to list table summaries, smaller amount of data needed to list all information for SHOW TABLES command.
Many remote systems (and implementors of DSv2 TableCatalog) can make implementation of newly added API with just one RPC.

Does this PR introduce any user-facing change?

New API called listTableSummaries is added to TableCatalog.

How was this patch tested?

Added new test cases in existing suites

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label May 14, 2025
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
// By default, we assume that all tables have standard table type.
return Arrays.stream(this.listTables(namespace))
.map(identifier -> new TableSummary(identifier, TableSummary.REGULAR_TABLE_TYPE))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another default implementation would be to loadTables and check properties of table to deduce what is table type.

package org.apache.spark.sql.connector.catalog;

import static com.google.common.base.Preconditions.checkNotNull;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made new class here to have it as return type to be more flexible for future changes, to avoid deprecation of method if we need to introduce third property of table summary for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding new fields to record class would create breaking changes too, for example, there are a few fields needed to add to make the API satisfy the ThriftServer case

private static final TableSchema RESULT_SET_SCHEMA = new TableSchema()
.addStringColumn("TABLE_CAT", "Catalog name. NULL if not applicable.")
.addStringColumn("TABLE_SCHEM", "Schema name.")
.addStringColumn("TABLE_NAME", "Table name.")
.addStringColumn("TABLE_TYPE", "The table type, e.g. \"TABLE\", \"VIEW\", etc.")
.addStringColumn("REMARKS", "Comments about the table.")
.addStringColumn("TYPE_CAT", "The types catalog.")
.addStringColumn("TYPE_SCHEM", "The types schema.")
.addStringColumn("TYPE_NAME", "Type name.")
.addStringColumn("SELF_REFERENCING_COL_NAME",
"Name of the designated \"identifier\" column of a typed table.")
.addStringColumn("REF_GENERATION",
"Specifies how values in SELF_REFERENCING_COL_NAME are created.");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding new fields would be breaking change for object creation, but not for TableCatalog.
Adding to many fields can make this API more expensive, not only because of fetch time from remote catalogs/servers, that is not so important, but because certain implementors may not return all fields by their listTableSummaries implementations (e.g. JDBC API).
Which options do you suggest to add?

Copy link
Contributor Author

@urosstan-db urosstan-db May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Evolving annotation here until we make this class stable, so you can easily add other fields in the future.

Copy link
Member

@pan3793 pan3793 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certain implementors may not return all fields by their listTableSummaries implementations

@urosstan-db it's fine to leave the unsupported fields as NULL, the Spark Thrift Server I mentioned is a concrete example that could benefit from this API directly. Given STS is a Spark built-in service, it's better to cover that than nothing. (I suppose you are adding this API for external system usage)

Which options do you suggest to add?

TableSummary can be an interface, with a default implementation

interface TableSummary {
  Identifier identifier();
  String tableType();
}
case class TableSummaryImpl(identifier: Identifier, tableType: String) extends TableSummary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, interface + internal implementation have better API compatibility

* @throws NoSuchNamespaceException If the namespace does not exist (optional).
*/
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
// By default, we assume that all tables have standard table type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default implementation should be calling loadTable for every name and getting the table type from properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was another proposal #50886 (comment). I agree, do we have some constant string for VIEW type?

Copy link
Member

@pan3793 pan3793 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to make the method also accept some predications? I'm working on ThriftServer meta API like GetSchemas, GetTables, and found it's hard to make the impl for v2 catalog as fast as session catalog. for example, we can not push the schemaPattern and tablePattern predications to v2 catalogs as we did for Hive external catalog

Copy link
Contributor Author

@urosstan-db urosstan-db May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think we can add a table pattern as well. In case some implementor does not support RPC with table pattern specification, then implementor would need to do filtering on Spark side.
@cloud-fan What do you think about that?
We have to define what is pattern as well, and to define some Util methods which can be used for filtering on Spark side in case pattern format is not compliant with pattern format of implementor and filtering can't be done on remote system.

Overall, I am not sure how often we may benefit from providing of patterns, usually DSv2 interface have exact names, not patterns, but I think it is strictly better to add it if we define util method I mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many list APIs in DS v2, and we can add overloads with an extra pattern string parameter later.

@urosstan-db urosstan-db changed the title Add listTableSummaries API to Data Source V2 Table Catalog API. [SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API. May 14, 2025
@urosstan-db urosstan-db changed the title [SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API. [SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API May 14, 2025
@urosstan-db urosstan-db marked this pull request as ready for review May 16, 2025 09:38
* @return an array of Identifiers for tables
* @throws NoSuchNamespaceException If the namespace does not exist (optional).
*/
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException, NoSuchTableException {

@urosstan-db urosstan-db force-pushed the SPARK-52109-Add-list-table-summaries-API-to-Table-catalog branch from b104a0b to 7159292 Compare May 16, 2025 14:28
@@ -79,6 +79,9 @@ private[sql] object V1Table {
def addV2TableProperties(v1Table: CatalogTable): Map[String, String] = {
val external = v1Table.tableType == CatalogTableType.EXTERNAL
val managed = v1Table.tableType == CatalogTableType.MANAGED
val tableTypeProperties: Map[String, String] = getV2TableType(v1Table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be Option[(String, String)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, code would be cleaner


val externalTableProperties = Map(
TableCatalog.PROP_EXTERNAL -> "1",
TableCatalog.PROP_LOCATION -> "s3://"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so HMS does not validate the path? maybe it's safer to use a temp local path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not validate it for creation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants