Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Support for OpenLineage #425

Open
collado-mike opened this issue Nov 5, 2024 · 0 comments
Open

[FEATURE REQUEST] Support for OpenLineage #425

collado-mike opened this issue Nov 5, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@collado-mike
Copy link
Contributor

Is your feature request related to a problem? Please describe.

OpenLineage is an open standard for reporting and collecting lineage information about processing jobs (i.e., which data sets were inputs to a processing job and which datasets were output). OpenLineage libraries are typically modeled as listeners or before/after hooks that are triggered by running processing jobs and have engine-specific code that collects information about the job and the datasets. That information is serialized as JSON and transmitted to a well-defined endpoint that either processes and stores that information or modifies and relays it to another endpoint.

Describe the solution you'd like

Polaris is a good candidate for a proxying lineage endpoint because it has a canonical view of the datasets being processed and can augment the lineage payload with useful data. This is especially true when Polaris is used to access External catalogs, where the authoritative metadata lives somewhere else.

Spark or other OpenLineage clients can only report information about the datasets that can be gleaned from the client - e.g., the namespace of the data will be the Polaris endpoint that was used to access the data. The name of the catalog will be whatever name assigned to the catalog in that particular application (e.g., a user might configure the catalog as either spark.sql.catalog.polaris or spark.sql.catalog.iceberg). A table might have been renamed or moved from another catalog.

Polaris, however, knows exactly where the dataset originated and can use the table metadata's UUID field to uniquely identify the dataset. It also knows the snapshot information (datasets can be versioned in OpenLineage) as well as the schema, table properties, and other information that could be reported as an OpenLineage facet.

Polaris doesn't make sense as a container for lineage information, as parsing and storing that information is not cheap. However, there is already precedence for an OpenLineage proxy, which can be used to augument the lineage information and pass it on to another service (Marquez is the reference implementation of the OpenLineage server endpoint).

Describe alternatives you've considered

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant