You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi team.
Sorry in advance if it is not a right place for this question.
I saw some presentation where you talked about Spark streaming jobs support. But now, if I understood correctly, streaming jobs support task does not have high priority and there are some issues to implementat it. Do you have some plans(when?) to implement Spark streaming jobs support? I saw this issue, has anything changed?
Currently we are using agent to collect lineage data for batch jobs, but also we need to collect lineage data for streaming jobs. I am wondering if you can share some details about issues you faced implementing streaming jobs support or maybe you can share some POC source code where streaming jobs support is implemented? It might be enough for our use-case.
Hi Oleks, thank you for your question.
Streaming support feature has indeed been abandoned and there is currently no plans to implement it. Unfortunately we encountered a lot of conceptual issues with collecting and representing data lineage in the context of streaming jobs. There were a few thought experiments and a few PoCs, but all were proved to be unpractical and basically useless in the current Spline model. Spline tracks data lineage, with the focus on the data that by definition need to have some identity and boundaries, so you could tell that this piece of data stored in that location was produced by the job A that read the data from file F that was in turn produced by the job B over there at the given date/time, etc and so on. With streaming everything gets more complicated. The data sources (e.g. Kafka topics) could be written and read at the same time by different clients and you cannot rely on the time axis when determining data dependencies (if client X wrote to the topic before the client Y read from it doesn't mean Y consumed X's data). You need to analyse offsets and consumer groups, but this info is not widely available. Also since it's streaming, the data flows are considered unbounded, meaning there is no logical end of writing or reading. There are no deletion or overwrites, the kafka topic and SQS is just a continuous flow of data. So representing streaming pipelines in term of data sources and jobs doesn't work as well as it does for batching. It simply doesn't fit the paradigm. We would have to rework the entire Spline in order to properly and meaningfully track streaming jobs. That would essentially be a completely new project. Never say never, but for now I do not see any real possibility for it to happen. Hope I answered your question at least partially :)
Hi,
thanks for the detailed answer,
you mentioned that you had some " few working PoCs" , can you share some branch we can have a look at ?
seems like some limitations you faced might not be a blocker in our use cases
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi team.
Sorry in advance if it is not a right place for this question.
I saw some presentation where you talked about Spark streaming jobs support. But now, if I understood correctly, streaming jobs support task does not have high priority and there are some issues to implementat it. Do you have some plans(when?) to implement Spark streaming jobs support? I saw this issue, has anything changed?
Currently we are using agent to collect lineage data for batch jobs, but also we need to collect lineage data for streaming jobs. I am wondering if you can share some details about issues you faced implementing streaming jobs support or maybe you can share some POC source code where streaming jobs support is implemented? It might be enough for our use-case.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions