[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

urosstan-db · 2025-05-14T08:42:36Z

What changes were proposed in this pull request?

Add new API in DSv2 TableCatalog class, called listTableSummaries. Return value should be array of TableSummary objects.
Added table type property which should represent a type of v2 table.
When conversion from v1 table to v2 table happens, we also provided table type property to v2 table.
JDBCTableCatalog would invoke listTables and treat all tables as foreign. On that way, we need just one SQL command on remote system to get table summaries.

Why are the changes needed?

Since DSv2 Table class can represent different table types, we currently need to do listTables + loadTable operations. Which can be expensive.

I propose adding new interface to list table summaries, smaller amount of data needed to list all information for SHOW TABLES command.
Many remote systems (and implementors of DSv2 TableCatalog) can make implementation of newly added API with just one RPC.

Does this PR introduce any user-facing change?

New API called listTableSummaries is added to TableCatalog.

How was this patch tested?

Added new test cases in existing suites

Was this patch authored or co-authored using generative AI tooling?

No

urosstan-db · 2025-05-14T08:43:50Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

+  default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
+    // By default, we assume that all tables have standard table type.
+    return Arrays.stream(this.listTables(namespace))
+      .map(identifier -> new TableSummary(identifier, TableSummary.REGULAR_TABLE_TYPE))


Another default implementation would be to loadTables and check properties of table to deduce what is table type.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableSummary.java

urosstan-db · 2025-05-14T08:44:55Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableSummary.java

+package org.apache.spark.sql.connector.catalog;
+
+import static com.google.common.base.Preconditions.checkNotNull;
+


I made new class here to have it as return type to be more flexible for future changes, to avoid deprecation of method if we need to introduce third property of table summary for example.

I think adding new fields to record class would create breaking changes too, for example, there are a few fields needed to add to make the API satisfy the ThriftServer case

spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetTablesOperation.java

Lines 55 to 67 in c82af8d

private static final TableSchema RESULT_SET_SCHEMA = new TableSchema()

.addStringColumn("TABLE_CAT", "Catalog name. NULL if not applicable.")

.addStringColumn("TABLE_SCHEM", "Schema name.")

.addStringColumn("TABLE_NAME", "Table name.")

.addStringColumn("TABLE_TYPE", "The table type, e.g. \"TABLE\", \"VIEW\", etc.")

.addStringColumn("REMARKS", "Comments about the table.")

.addStringColumn("TYPE_CAT", "The types catalog.")

.addStringColumn("TYPE_SCHEM", "The types schema.")

.addStringColumn("TYPE_NAME", "Type name.")

.addStringColumn("SELF_REFERENCING_COL_NAME",

"Name of the designated \"identifier\" column of a typed table.")

.addStringColumn("REF_GENERATION",

"Specifies how values in SELF_REFERENCING_COL_NAME are created.");

Yeah, adding new fields would be breaking change for object creation, but not for TableCatalog.
Adding to many fields can make this API more expensive, not only because of fetch time from remote catalogs/servers, that is not so important, but because certain implementors may not return all fields by their listTableSummaries implementations (e.g. JDBC API).
Which options do you suggest to add?

Added Evolving annotation here until we make this class stable, so you can easily add other fields in the future.

certain implementors may not return all fields by their listTableSummaries implementations

@urosstan-db it's fine to leave the unsupported fields as NULL, the Spark Thrift Server I mentioned is a concrete example that could benefit from this API directly. Given STS is a Spark built-in service, it's better to cover that than nothing. (I suppose you are adding this API for external system usage)

Which options do you suggest to add?

TableSummary can be an interface, with a default implementation

interface TableSummary { Identifier identifier(); String tableType(); }

case class TableSummaryImpl(identifier: Identifier, tableType: String) extends TableSummary

+1, interface + internal implementation have better API compatibility

cloud-fan · 2025-05-14T08:45:21Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

+   * @throws NoSuchNamespaceException If the namespace does not exist (optional).
+   */
+  default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
+    // By default, we assume that all tables have standard table type.


I think the default implementation should be calling loadTable for every name and getting the table type from properties.

Yeah, that was another proposal #50886 (comment). I agree, do we have some constant string for VIEW type?

is it possible to make the method also accept some predications? I'm working on ThriftServer meta API like GetSchemas, GetTables, and found it's hard to make the impl for v2 catalog as fast as session catalog. for example, we can not push the schemaPattern and tablePattern predications to v2 catalogs as we did for Hive external catalog

Good question, I think we can add a table pattern as well. In case some implementor does not support RPC with table pattern specification, then implementor would need to do filtering on Spark side.
@cloud-fan What do you think about that?
We have to define what is pattern as well, and to define some Util methods which can be used for filtering on Spark side in case pattern format is not compliant with pattern format of implementor and filtering can't be done on remote system.

Overall, I am not sure how often we may benefit from providing of patterns, usually DSv2 interface have exact names, not patterns, but I think it is strictly better to add it if we define util method I mentioned.

There are many list APIs in DS v2, and we can add overloads with an extra pattern string parameter later.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

Co-authored-by: Wenchen Fan <[email protected]>

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

cloud-fan · 2025-05-16T10:22:16Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

+   * @return an array of Identifiers for tables
+   * @throws NoSuchNamespaceException If the namespace does not exist (optional).
+   */
+  default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {


Suggested change

default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {

default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException, NoSuchTableException {

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

cloud-fan · 2025-05-16T14:41:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala

@@ -79,6 +79,9 @@ private[sql] object V1Table {
  def addV2TableProperties(v1Table: CatalogTable): Map[String, String] = {
    val external = v1Table.tableType == CatalogTableType.EXTERNAL
    val managed = v1Table.tableType == CatalogTableType.MANAGED
+    val tableTypeProperties: Map[String, String] = getV2TableType(v1Table)


this should be Option[(String, String)]

Makes sense, code would be cleaner

cloud-fan · 2025-05-16T14:44:23Z

...ore/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalogSuite.scala

+
+    val externalTableProperties = Map(
+      TableCatalog.PROP_EXTERNAL -> "1",
+      TableCatalog.PROP_LOCATION -> "s3://"


so HMS does not validate the path? maybe it's safer to use a temp local path.

No, it does not validate it for creation

Co-authored-by: Wenchen Fan <[email protected]>

Add listTableSummaries API to Data Source V2 Table Catalog API.

06fb6ee

github-actions bot added the SQL label May 14, 2025

urosstan-db commented May 14, 2025

View reviewed changes

cloud-fan reviewed May 14, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableSummary.java Outdated Show resolved Hide resolved

urosstan-db commented May 14, 2025

View reviewed changes

cloud-fan reviewed May 14, 2025

View reviewed changes

urosstan-db commented May 14, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java Show resolved Hide resolved

Rename static field for view table type

5777b60

Co-authored-by: Wenchen Fan <[email protected]>

urosstan-db changed the title ~~Add listTableSummaries API to Data Source V2 Table Catalog API.~~ [SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API. May 14, 2025

urosstan-db changed the title ~~[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API.~~ [SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API May 14, 2025

Introduce foreign table type

5373ee1

urosstan-db commented May 16, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java Outdated Show resolved Hide resolved

urosstan-db marked this pull request as ready for review May 16, 2025 09:38

cloud-fan reviewed May 16, 2025

View reviewed changes

urosstan-db added 3 commits May 16, 2025 14:52

Set default table type to FOREIGN, add v2 properties from v1 table

53d5996

Make tableSummary interface

a162981

Change signature of listTableSummaries to allow another exception

7159292

urosstan-db force-pushed the SPARK-52109-Add-list-table-summaries-API-to-Table-catalog branch from b104a0b to 7159292 Compare May 16, 2025 14:28

cloud-fan reviewed May 16, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java Show resolved Hide resolved

cloud-fan reviewed May 16, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 16, 2025

View reviewed changes

cloud-fan approved these changes May 16, 2025

View reviewed changes

urosstan-db and others added 4 commits May 16, 2025 16:45

Apply suggestions from code review

eb1410c

Co-authored-by: Wenchen Fan <[email protected]>

Simplify code

a889d54

Add new property to reserved properties set

90e4e74

Fix scalastyle

57f4ffd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

urosstan-db commented May 14, 2025 •

edited

Loading

urosstan-db May 14, 2025

urosstan-db May 14, 2025

pan3793 May 16, 2025

urosstan-db May 16, 2025

urosstan-db May 16, 2025 •

edited

Loading

pan3793 May 16, 2025 •

edited

Loading

cloud-fan May 16, 2025

cloud-fan May 14, 2025

urosstan-db May 14, 2025

pan3793 May 16, 2025 •

edited

Loading

urosstan-db May 16, 2025 •

edited

Loading

cloud-fan May 16, 2025

cloud-fan May 16, 2025

cloud-fan May 16, 2025

urosstan-db May 16, 2025

cloud-fan May 16, 2025

urosstan-db May 16, 2025

		package org.apache.spark.sql.connector.catalog;

		import static com.google.common.base.Preconditions.checkNotNull;

	private static final TableSchema RESULT_SET_SCHEMA = new TableSchema()
	.addStringColumn("TABLE_CAT", "Catalog name. NULL if not applicable.")
	.addStringColumn("TABLE_SCHEM", "Schema name.")
	.addStringColumn("TABLE_NAME", "Table name.")
	.addStringColumn("TABLE_TYPE", "The table type, e.g. \"TABLE\", \"VIEW\", etc.")
	.addStringColumn("REMARKS", "Comments about the table.")
	.addStringColumn("TYPE_CAT", "The types catalog.")
	.addStringColumn("TYPE_SCHEM", "The types schema.")
	.addStringColumn("TYPE_NAME", "Type name.")
	.addStringColumn("SELF_REFERENCING_COL_NAME",
	"Name of the designated \"identifier\" column of a typed table.")
	.addStringColumn("REF_GENERATION",
	"Specifies how values in SELF_REFERENCING_COL_NAME are created.");

	default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException {
	default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException, NoSuchTableException {

[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

Are you sure you want to change the base?

[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886

Conversation

urosstan-db commented May 14, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urosstan-db May 16, 2025 • edited Loading

Choose a reason for hiding this comment

pan3793 May 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 May 16, 2025 • edited Loading

Choose a reason for hiding this comment

urosstan-db May 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urosstan-db commented May 14, 2025 •

edited

Loading

urosstan-db May 16, 2025 •

edited

Loading

pan3793 May 16, 2025 •

edited

Loading

pan3793 May 16, 2025 •

edited

Loading

urosstan-db May 16, 2025 •

edited

Loading