Lineage #94

shreyashankar · 2024-10-11T16:10:07Z

Reduce Operation Lineage

From discord:

One use case I'm really interested in is [pre]computing a set of "reports" / outputs from a large set of documents, and then being able to reuse that computation when I filter documents to the applicable reports that have only those documents as "Sources"

i.e

if full corpus is a, b, c, d, e, f -> generates reports 1 (a, b, c) + 2 (b, c, d) + 3 (c, d, e) + 4 (d, e, f)
and then I want to see the "reports" contributed by docs d, e, f = 2,3,4

My proposal is to support a lineage param in the output, e.g.,

name: opname
type: reduce
reduce_key: ...
prompt: ...
output:
  schema: ...
  lineage:
    - keyname1
    - keyname2

then for every document in the output, there should be a key opname_lineage with a list of kv pairs for all the keys in lineage, for all documents in the group that the output document was derived from.

Querying Pipeline Lineage

It would be nice to log all the pipeline lineage to sqlite & have users be able to query it (e.g., find all the reports contributed by certain upstream/input docs). We'd have to think of a good data model & query patterns.

The text was updated successfully, but these errors were encountered:

garuna-m6 · 2024-10-18T18:44:54Z

@shreyashankar took some more time that thought to get the OpenAI keys :( , trying to understand the issue here, we need tracing in logs for lineage reduce operations (don't want the sql setup anywhere in pipeline). With existing verbose functionality have logging like reduce : lineage keys [if used] : reduce operation output in logging 👀 ? Would need some guidance

shreyashankar · 2024-10-19T20:50:41Z

No worries!

I think the logging can be set at a pipeline level; in the top level of the config someone can specify the path to store a sqlite db of the logs; then, we can add ids to each document in the input and pass them through each operation in the pipeline.

For each operation, we could create a table of the outputs, with an additional "id" column. We could also create a dependency table for each operation to link the operation's outputs with the id(s) of its inputs:

CREATE TABLE {operation}_dependencies (
    dependent_id INTEGER REFERENCES dependent_table(id),
    main_id INTEGER REFERENCES main_table(id),
    PRIMARY KEY (dependent_id, main_id)
);

So, each operation has its own output table, as well as a dependencies table. This can enable both forward and backwards tracing.

garuna-m6 · 2024-10-20T15:19:08Z

Sorry for asking explanations as a 5 year old, but docetl pipeline would run on demand, the expectation here is to start a sqlite local server if set in config, put all the logs in the db then close the pipeline shutting down server :/ or dump the logs for a sql server to read or are we expecting the server connection files are present?

shreyashankar · 2024-10-21T01:14:56Z

No worries, sorry for the confusion! Sqlite doesn't require a separate server process: https://docs.python.org/3/library/sqlite3.html

So if the user specifies a path for the sqlite db in the config, we can create a db and populate relevant tables as the outputs are created.

redhog · 2024-10-21T11:09:25Z

Is there a big reason to keep the lineage data out-of-bound?

I'd rather save lineage info inside the items, so that an outside system that gets the final output dump, has access to it directly (without a join). What's the drawback of doing that?

I think sqlite output is interesting in the context of #104 btw :)

Also, potentially for storing the intermediate data more efficiently?

shreyashankar · 2024-10-21T14:55:34Z

I think saving it to a database makes it significantly more queryable...otherwise constructing forward traces will involve a bunch of for loops to go through the outputs and see which ids contain the source id. Similarly constructing a backwards traces will require lots of wrangling.

redhog · 2024-10-21T15:35:25Z

Well, that depends on what happens with the output. If it's just a json, yes. But if you insert it into something like elastic-search, then having the metadata / lineage inline is super useful. So maybe both?

If we had output plugins, and could write multiple outputs with different plugins, then this could be handled at the output stage:

pipeline:
  steps: ...
  output;
    - json:
      path: my-pipeline-output.json
    - sqlite:
      path: metadata.sqlite
      keys:
        - source-file
        - page
     - elasticsearch:
       url: http://localhost:9200/

shreyashankar · 2024-11-18T06:38:53Z

whoops, sorry I missed this. I like your operator spec, but I think supporting an elastic search integration as a plugin can be done later down the line. most people use DocETL locally, and I think the sqlite interface is a great start for them

shreyashankar added enhancement New feature or request request labels Oct 11, 2024

This was referenced Oct 12, 2024

feat: add reduce operation lineage #101

Merged

#91 document > item renaming #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lineage #94

Lineage #94

shreyashankar commented Oct 11, 2024

garuna-m6 commented Oct 18, 2024

shreyashankar commented Oct 19, 2024

garuna-m6 commented Oct 20, 2024

shreyashankar commented Oct 21, 2024

redhog commented Oct 21, 2024

shreyashankar commented Oct 21, 2024

redhog commented Oct 21, 2024

shreyashankar commented Nov 18, 2024

Lineage #94

Lineage #94

Comments

shreyashankar commented Oct 11, 2024

Reduce Operation Lineage

Querying Pipeline Lineage

garuna-m6 commented Oct 18, 2024

shreyashankar commented Oct 19, 2024

garuna-m6 commented Oct 20, 2024

shreyashankar commented Oct 21, 2024

redhog commented Oct 21, 2024

shreyashankar commented Oct 21, 2024

redhog commented Oct 21, 2024

shreyashankar commented Nov 18, 2024