Streaming support #53

kellyjonbrazil · 2023-01-14T21:44:14Z

Enhancement to add streaming support so the entire JSON document doesn't need to be loaded to start processing.

Looks like the ijson library might handle a lot of this. I think I might be able to create a jello -S (for streaming) option that uses the ijson library to parse STDIN and return _ as a generator/iterator of JSON objects - whether it's an array or a top-level of JSON objects.

The text was updated successfully, but these errors were encountered:

stevenpelley · 2024-12-03T15:38:23Z

Hi @kellyjonbrazil
Thank you for writing this (and jellox, jc)! I use jello as a near-total replacement for jq and need this feature. I know I didn't discuss this ahead of time but I took a stab at writing it. See the 3 most recent commits at https://github.com/stevenpelley/jello/tree/streaming

Please let me know if you'd be willing to discuss a PR and I'll open one. I'm also open to recommendations for organization, design choices, and style.

This is also a workaround for #67 or leaves a small amount of work left to implement it.

The core changes:

compile the user python query to ast. Place the resulting module's body in a new function definition. If the last statement is a stand-alone expression wrap it in a return statement. Compile, exec, and call the created function instead of exec'ing all but the last statement followed by eval'ing the last statement. This allows return, yield, and yield from statements in the user python query. If the resulting function is a generator or returns an iterator place the contents of the iterator into a list before serializing json.
allow flattened, streamed ndjson (newline-delimited json, effectively equivalent to jsonl, just a different work/spec) output. New option -F. Result of user python query must be a list, iterator, or generator and otherwise is an error. Each item is json-serialized and formatted as it is available. The entire output set will not be held in memory.
allow streaming input of ndjson. New option -S. Inputs must be ndjson. "" is an iterator of the json objects. This ended up being somewhat of an invasive change as the "data" and "" variables are now file streams/references and iterators instead of the actual json contents. Input is not read until the cli calls for pyquery or to format data, and so exception handling has to change to track what work is actually being done when exceptions arise. This reorders some checks and data processing so it's possible that even for non-streaming commands to get a different error than previously.

Relative to your proposal this only supports ndjson/jsonl. It does not stream a single large json doc as ijson might. This fits all of my use cases although others may have different needs.

Some examples:

flattening to generate data with little memory:

jello -e -c -F '({"value": i} for i in range(1000))'

the user python query, and therefore the resulting function, returns the generator expression which is streamed.
output:

{"value":0}
{"value":1}
{"value":2}
{"value":3}
{"value":4}
{"value":5}
{"value":6}
{"value":7}
{"value":8}
{"value":9}
...

reading the above input, streaming the input, and taking the sum of all the "value"s

jello -e -c -F '({"value": i} for i in range(1000))' | jello -S 'sum(item.value for item in _)'

DotMap works -- value may be accessed as item.value
output:

performing line-by-line transformation in a streaming manner. In this case add an attribute "is_even":

jello -e -c -F '({"value": i} for i in range(1000))' | jello -c -S -F '(item | {"is_even": item.value % 2 == 0} for item in _)'

output:

{"value":0,"is_even":true}
{"value":1,"is_even":false}
{"value":2,"is_even":true}
{"value":3,"is_even":false}
{"value":4,"is_even":true}
{"value":5,"is_even":false}
{"value":6,"is_even":true}
{"value":7,"is_even":false}
{"value":8,"is_even":true}
{"value":9,"is_even":false}
...

in general the "lines" behavior requested in #67 looks like

jello -F -S '(f(i) for i in _)'

where "f" is the transformation to be performed. This is also how I'd implement that feature (using python ast package), but at this point I think it's a marginal improvement.

kellyjonbrazil · 2024-12-04T02:31:26Z

Hi there - thank you so much for offering this contribution! I think this looks fantastic. Please do create a PR... I'd like to play around with this a bit. I think we should also include some updated documentation in the PR. Thanks for including the tests, too!

One thing to note - I'll do my best to troubleshoot/fix if any bugs are reported after release of this feature, but could I ping you for help if I get stuck?

kellyjonbrazil · 2024-12-04T20:41:51Z

Also, could you make sure to branch from dev?

stevenpelley · 2024-12-05T01:17:23Z

Also, could you make sure to branch from dev?

Had already realized my mistake and just rebased. Streaming input doesn't support raw input, which I missed because I was working off of master.

PR is up at #69

One thing to note - I'll do my best to troubleshoot/fix if any bugs are reported after release of this feature, but could I ping you for help if I get stuck?

Of course

kellyjonbrazil added the enhancement New feature or request label Jan 14, 2023

smammy mentioned this issue May 26, 2023

Enhancement: process each JSON Line separately #67

Open

stevenpelley mentioned this issue Dec 5, 2024

streaming ndjson input and streaming flattened ndjson output (#53) #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming support #53

Streaming support #53

kellyjonbrazil commented Jan 14, 2023

stevenpelley commented Dec 3, 2024

kellyjonbrazil commented Dec 4, 2024

kellyjonbrazil commented Dec 4, 2024

stevenpelley commented Dec 5, 2024 •

edited

Loading

Streaming support #53

Streaming support #53

Comments

kellyjonbrazil commented Jan 14, 2023

stevenpelley commented Dec 3, 2024

kellyjonbrazil commented Dec 4, 2024

kellyjonbrazil commented Dec 4, 2024

stevenpelley commented Dec 5, 2024 • edited Loading

stevenpelley commented Dec 5, 2024 •

edited

Loading