Spark Streaming support #724

simaov · 2023-07-17T13:26:33Z

simaov
Jul 17, 2023

Hi team.
Sorry in advance if it is not a right place for this question.
I saw some presentation where you talked about Spark streaming jobs support. But now, if I understood correctly, streaming jobs support task does not have high priority and there are some issues to implementat it. Do you have some plans(when?) to implement Spark streaming jobs support? I saw this issue, has anything changed?

Currently we are using agent to collect lineage data for batch jobs, but also we need to collect lineage data for streaming jobs. I am wondering if you can share some details about issues you faced implementing streaming jobs support or maybe you can share some POC source code where streaming jobs support is implemented? It might be enough for our use-case.

Thank you.

wajda · 2023-07-17T17:09:49Z

wajda
Jul 17, 2023
Maintainer

Hi Oleks, thank you for your question.
Streaming support feature has indeed been abandoned and there is currently no plans to implement it. Unfortunately we encountered a lot of conceptual issues with collecting and representing data lineage in the context of streaming jobs. There were a few thought experiments and a few PoCs, but all were proved to be unpractical and basically useless in the current Spline model. Spline tracks data lineage, with the focus on the data that by definition need to have some identity and boundaries, so you could tell that this piece of data stored in that location was produced by the job A that read the data from file F that was in turn produced by the job B over there at the given date/time, etc and so on. With streaming everything gets more complicated. The data sources (e.g. Kafka topics) could be written and read at the same time by different clients and you cannot rely on the time axis when determining data dependencies (if client X wrote to the topic before the client Y read from it doesn't mean Y consumed X's data). You need to analyse offsets and consumer groups, but this info is not widely available. Also since it's streaming, the data flows are considered unbounded, meaning there is no logical end of writing or reading. There are no deletion or overwrites, the kafka topic and SQS is just a continuous flow of data. So representing streaming pipelines in term of data sources and jobs doesn't work as well as it does for batching. It simply doesn't fit the paradigm. We would have to rework the entire Spline in order to properly and meaningfully track streaming jobs. That would essentially be a completely new project. Never say never, but for now I do not see any real possibility for it to happen. Hope I answered your question at least partially :)

3 replies

ninioroy Jul 18, 2023

Hi,
thanks for the detailed answer,
you mentioned that you had some " few working PoCs" , can you share some branch we can have a look at ?
seems like some limitations you faced might not be a blocker in our use cases

wajda Jul 18, 2023
Maintainer

#407

simaov Jul 18, 2023
Author

@wajda thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Streaming support #724

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Spark Streaming support #724

simaov Jul 17, 2023

Replies: 1 comment · 3 replies

wajda Jul 17, 2023 Maintainer

ninioroy Jul 18, 2023

wajda Jul 18, 2023 Maintainer

simaov Jul 18, 2023 Author

simaov
Jul 17, 2023

Replies: 1 comment 3 replies

wajda
Jul 17, 2023
Maintainer

wajda Jul 18, 2023
Maintainer

simaov Jul 18, 2023
Author