[FEATURE REQUEST] Support for OpenLineage #425

collado-mike · 2024-11-05T16:59:44Z

Is your feature request related to a problem? Please describe.

OpenLineage is an open standard for reporting and collecting lineage information about processing jobs (i.e., which data sets were inputs to a processing job and which datasets were output). OpenLineage libraries are typically modeled as listeners or before/after hooks that are triggered by running processing jobs and have engine-specific code that collects information about the job and the datasets. That information is serialized as JSON and transmitted to a well-defined endpoint that either processes and stores that information or modifies and relays it to another endpoint.

Describe the solution you'd like

Polaris is a good candidate for a proxying lineage endpoint because it has a canonical view of the datasets being processed and can augment the lineage payload with useful data. This is especially true when Polaris is used to access External catalogs, where the authoritative metadata lives somewhere else.

Spark or other OpenLineage clients can only report information about the datasets that can be gleaned from the client - e.g., the namespace of the data will be the Polaris endpoint that was used to access the data. The name of the catalog will be whatever name assigned to the catalog in that particular application (e.g., a user might configure the catalog as either spark.sql.catalog.polaris or spark.sql.catalog.iceberg). A table might have been renamed or moved from another catalog.

Polaris, however, knows exactly where the dataset originated and can use the table metadata's UUID field to uniquely identify the dataset. It also knows the snapshot information (datasets can be versioned in OpenLineage) as well as the schema, table properties, and other information that could be reported as an OpenLineage facet.

Polaris doesn't make sense as a container for lineage information, as parsing and storing that information is not cheap. However, there is already precedence for an OpenLineage proxy, which can be used to augument the lineage information and pass it on to another service (Marquez is the reference implementation of the OpenLineage server endpoint).

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

collado-mike added the enhancement New feature or request label Nov 5, 2024

github-project-automation bot added this to Basic Kanban Board Nov 5, 2024

collado-mike mentioned this issue Nov 5, 2024

[INTEGRATION] Support Apache Gravitino data catalog pluggability OpenLineage/OpenLineage#3160

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] Support for OpenLineage #425

[FEATURE REQUEST] Support for OpenLineage #425

collado-mike commented Nov 5, 2024

[FEATURE REQUEST] Support for OpenLineage #425

[FEATURE REQUEST] Support for OpenLineage #425

Comments

collado-mike commented Nov 5, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context