You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
OpenLineage is an open standard for reporting and collecting lineage information about processing jobs (i.e., which data sets were inputs to a processing job and which datasets were output). OpenLineage libraries are typically modeled as listeners or before/after hooks that are triggered by running processing jobs and have engine-specific code that collects information about the job and the datasets. That information is serialized as JSON and transmitted to a well-defined endpoint that either processes and stores that information or modifies and relays it to another endpoint.
Describe the solution you'd like
Polaris is a good candidate for a proxying lineage endpoint because it has a canonical view of the datasets being processed and can augment the lineage payload with useful data. This is especially true when Polaris is used to access External catalogs, where the authoritative metadata lives somewhere else.
Spark or other OpenLineage clients can only report information about the datasets that can be gleaned from the client - e.g., the namespace of the data will be the Polaris endpoint that was used to access the data. The name of the catalog will be whatever name assigned to the catalog in that particular application (e.g., a user might configure the catalog as either spark.sql.catalog.polaris or spark.sql.catalog.iceberg). A table might have been renamed or moved from another catalog.
Polaris, however, knows exactly where the dataset originated and can use the table metadata's UUID field to uniquely identify the dataset. It also knows the snapshot information (datasets can be versioned in OpenLineage) as well as the schema, table properties, and other information that could be reported as an OpenLineage facet.
Polaris doesn't make sense as a container for lineage information, as parsing and storing that information is not cheap. However, there is already precedence for an OpenLineage proxy, which can be used to augument the lineage information and pass it on to another service (Marquez is the reference implementation of the OpenLineage server endpoint).
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
OpenLineage is an open standard for reporting and collecting lineage information about processing jobs (i.e., which data sets were inputs to a processing job and which datasets were output). OpenLineage libraries are typically modeled as listeners or before/after hooks that are triggered by running processing jobs and have engine-specific code that collects information about the job and the datasets. That information is serialized as JSON and transmitted to a well-defined endpoint that either processes and stores that information or modifies and relays it to another endpoint.
Describe the solution you'd like
Polaris is a good candidate for a proxying lineage endpoint because it has a canonical view of the datasets being processed and can augment the lineage payload with useful data. This is especially true when Polaris is used to access External catalogs, where the authoritative metadata lives somewhere else.
Spark or other OpenLineage clients can only report information about the datasets that can be gleaned from the client - e.g., the namespace of the data will be the Polaris endpoint that was used to access the data. The name of the catalog will be whatever name assigned to the catalog in that particular application (e.g., a user might configure the catalog as either
spark.sql.catalog.polaris
orspark.sql.catalog.iceberg
). A table might have been renamed or moved from another catalog.Polaris, however, knows exactly where the dataset originated and can use the table metadata's UUID field to uniquely identify the dataset. It also knows the snapshot information (datasets can be versioned in OpenLineage) as well as the schema, table properties, and other information that could be reported as an OpenLineage facet.
Polaris doesn't make sense as a container for lineage information, as parsing and storing that information is not cheap. However, there is already precedence for an OpenLineage proxy, which can be used to augument the lineage information and pass it on to another service (Marquez is the reference implementation of the OpenLineage server endpoint).
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: