-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-52109] Add listTableSummaries API to Data Source V2 Table Catalog API #50886
Conversation
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException { | ||
// By default, we assume that all tables have standard table type. | ||
return Arrays.stream(this.listTables(namespace)) | ||
.map(identifier -> new TableSummary(identifier, TableSummary.REGULAR_TABLE_TYPE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another default implementation would be to loadTables
and check properties of table to deduce what is table type.
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableSummary.java
Outdated
Show resolved
Hide resolved
package org.apache.spark.sql.connector.catalog; | ||
|
||
import static com.google.common.base.Preconditions.checkNotNull; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made new class here to have it as return type to be more flexible for future changes, to avoid deprecation of method if we need to introduce third property of table summary for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding new fields to record class would create breaking changes too, for example, there are a few fields needed to add to make the API satisfy the ThriftServer case
Lines 55 to 67 in c82af8d
private static final TableSchema RESULT_SET_SCHEMA = new TableSchema() | |
.addStringColumn("TABLE_CAT", "Catalog name. NULL if not applicable.") | |
.addStringColumn("TABLE_SCHEM", "Schema name.") | |
.addStringColumn("TABLE_NAME", "Table name.") | |
.addStringColumn("TABLE_TYPE", "The table type, e.g. \"TABLE\", \"VIEW\", etc.") | |
.addStringColumn("REMARKS", "Comments about the table.") | |
.addStringColumn("TYPE_CAT", "The types catalog.") | |
.addStringColumn("TYPE_SCHEM", "The types schema.") | |
.addStringColumn("TYPE_NAME", "Type name.") | |
.addStringColumn("SELF_REFERENCING_COL_NAME", | |
"Name of the designated \"identifier\" column of a typed table.") | |
.addStringColumn("REF_GENERATION", | |
"Specifies how values in SELF_REFERENCING_COL_NAME are created."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, adding new fields would be breaking change for object creation, but not for TableCatalog
.
Adding to many fields can make this API more expensive, not only because of fetch time from remote catalogs/servers, that is not so important, but because certain implementors may not return all fields by their listTableSummaries
implementations (e.g. JDBC API).
Which options do you suggest to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Evolving
annotation here until we make this class stable, so you can easily add other fields in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certain implementors may not return all fields by their listTableSummaries implementations
@urosstan-db it's fine to leave the unsupported fields as NULL, the Spark Thrift Server I mentioned is a concrete example that could benefit from this API directly. Given STS is a Spark built-in service, it's better to cover that than nothing. (I suppose you are adding this API for external system usage)
Which options do you suggest to add?
TableSummary can be an interface, with a default implementation
interface TableSummary {
Identifier identifier();
String tableType();
}
case class TableSummaryImpl(identifier: Identifier, tableType: String) extends TableSummary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, interface + internal implementation have better API compatibility
* @throws NoSuchNamespaceException If the namespace does not exist (optional). | ||
*/ | ||
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException { | ||
// By default, we assume that all tables have standard table type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default implementation should be calling loadTable
for every name and getting the table type from properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was another proposal #50886 (comment). I agree, do we have some constant string for VIEW type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to make the method also accept some predications? I'm working on ThriftServer meta API like GetSchemas, GetTables, and found it's hard to make the impl for v2 catalog as fast as session catalog. for example, we can not push the schemaPattern
and tablePattern
predications to v2 catalogs as we did for Hive external catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I think we can add a table pattern as well. In case some implementor does not support RPC with table pattern specification, then implementor would need to do filtering on Spark side.
@cloud-fan What do you think about that?
We have to define what is pattern as well, and to define some Util methods which can be used for filtering on Spark side in case pattern format is not compliant with pattern format of implementor and filtering can't be done on remote system.
Overall, I am not sure how often we may benefit from providing of patterns, usually DSv2 interface have exact names, not patterns, but I think it is strictly better to add it if we define util method I mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many list APIs in DS v2, and we can add overloads with an extra pattern string parameter later.
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Show resolved
Hide resolved
Co-authored-by: Wenchen Fan <[email protected]>
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Outdated
Show resolved
Hide resolved
* @return an array of Identifiers for tables | ||
* @throws NoSuchNamespaceException If the namespace does not exist (optional). | ||
*/ | ||
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException { | |
default TableSummary[] listTableSummaries(String[] namespace) throws NoSuchNamespaceException, NoSuchTableException { |
b104a0b
to
7159292
Compare
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java
Outdated
Show resolved
Hide resolved
@@ -79,6 +79,9 @@ private[sql] object V1Table { | |||
def addV2TableProperties(v1Table: CatalogTable): Map[String, String] = { | |||
val external = v1Table.tableType == CatalogTableType.EXTERNAL | |||
val managed = v1Table.tableType == CatalogTableType.MANAGED | |||
val tableTypeProperties: Map[String, String] = getV2TableType(v1Table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be Option[(String, String)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, code would be cleaner
|
||
val externalTableProperties = Map( | ||
TableCatalog.PROP_EXTERNAL -> "1", | ||
TableCatalog.PROP_LOCATION -> "s3://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so HMS does not validate the path? maybe it's safer to use a temp local path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not validate it for creation
Co-authored-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
TableCatalog
class, calledlistTableSummaries
. Return value should be array ofTableSummary
objects.JDBCTableCatalog
would invokelistTables
and treat all tables as foreign. On that way, we need just one SQL command on remote system to get table summaries.Why are the changes needed?
Since DSv2
Table
class can represent different table types, we currently need to dolistTables
+loadTable
operations. Which can be expensive.I propose adding new interface to list table summaries, smaller amount of data needed to list all information for SHOW TABLES command.
Many remote systems (and implementors of DSv2 TableCatalog) can make implementation of newly added API with just one RPC.
Does this PR introduce any user-facing change?
New API called
listTableSummaries
is added toTableCatalog
.How was this patch tested?
Added new test cases in existing suites
Was this patch authored or co-authored using generative AI tooling?
No