Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming support #53

Open
kellyjonbrazil opened this issue Jan 14, 2023 · 4 comments
Open

Streaming support #53

kellyjonbrazil opened this issue Jan 14, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@kellyjonbrazil
Copy link
Owner

Enhancement to add streaming support so the entire JSON document doesn't need to be loaded to start processing.

Looks like the ijson library might handle a lot of this. I think I might be able to create a jello -S (for streaming) option that uses the ijson library to parse STDIN and return _ as a generator/iterator of JSON objects - whether it's an array or a top-level of JSON objects.

@stevenpelley
Copy link

Hi @kellyjonbrazil
Thank you for writing this (and jellox, jc)! I use jello as a near-total replacement for jq and need this feature. I know I didn't discuss this ahead of time but I took a stab at writing it. See the 3 most recent commits at https://github.com/stevenpelley/jello/tree/streaming

Please let me know if you'd be willing to discuss a PR and I'll open one. I'm also open to recommendations for organization, design choices, and style.

This is also a workaround for #67 or leaves a small amount of work left to implement it.

The core changes:

  1. compile the user python query to ast. Place the resulting module's body in a new function definition. If the last statement is a stand-alone expression wrap it in a return statement. Compile, exec, and call the created function instead of exec'ing all but the last statement followed by eval'ing the last statement. This allows return, yield, and yield from statements in the user python query. If the resulting function is a generator or returns an iterator place the contents of the iterator into a list before serializing json.
  2. allow flattened, streamed ndjson (newline-delimited json, effectively equivalent to jsonl, just a different work/spec) output. New option -F. Result of user python query must be a list, iterator, or generator and otherwise is an error. Each item is json-serialized and formatted as it is available. The entire output set will not be held in memory.
  3. allow streaming input of ndjson. New option -S. Inputs must be ndjson. "" is an iterator of the json objects. This ended up being somewhat of an invasive change as the "data" and "" variables are now file streams/references and iterators instead of the actual json contents. Input is not read until the cli calls for pyquery or to format data, and so exception handling has to change to track what work is actually being done when exceptions arise. This reorders some checks and data processing so it's possible that even for non-streaming commands to get a different error than previously.

Relative to your proposal this only supports ndjson/jsonl. It does not stream a single large json doc as ijson might. This fits all of my use cases although others may have different needs.

Some examples:

flattening to generate data with little memory:

jello -e -c -F '({"value": i} for i in range(1000))'

the user python query, and therefore the resulting function, returns the generator expression which is streamed.
output:

{"value":0}
{"value":1}
{"value":2}
{"value":3}
{"value":4}
{"value":5}
{"value":6}
{"value":7}
{"value":8}
{"value":9}
...

reading the above input, streaming the input, and taking the sum of all the "value"s

jello -e -c -F '({"value": i} for i in range(1000))' | jello -S 'sum(item.value for item in _)'

DotMap works -- value may be accessed as item.value
output:

499500

performing line-by-line transformation in a streaming manner. In this case add an attribute "is_even":

jello -e -c -F '({"value": i} for i in range(1000))' | jello -c -S -F '(item | {"is_even": item.value % 2 == 0} for item in _)'

output:

{"value":0,"is_even":true}
{"value":1,"is_even":false}
{"value":2,"is_even":true}
{"value":3,"is_even":false}
{"value":4,"is_even":true}
{"value":5,"is_even":false}
{"value":6,"is_even":true}
{"value":7,"is_even":false}
{"value":8,"is_even":true}
{"value":9,"is_even":false}
...

in general the "lines" behavior requested in #67 looks like

jello -F -S '(f(i) for i in _)'

where "f" is the transformation to be performed. This is also how I'd implement that feature (using python ast package), but at this point I think it's a marginal improvement.

@kellyjonbrazil
Copy link
Owner Author

Hi there - thank you so much for offering this contribution! I think this looks fantastic. Please do create a PR... I'd like to play around with this a bit. I think we should also include some updated documentation in the PR. Thanks for including the tests, too!

One thing to note - I'll do my best to troubleshoot/fix if any bugs are reported after release of this feature, but could I ping you for help if I get stuck?

@kellyjonbrazil
Copy link
Owner Author

Also, could you make sure to branch from dev?

@stevenpelley
Copy link

stevenpelley commented Dec 5, 2024

Also, could you make sure to branch from dev?

Had already realized my mistake and just rebased. Streaming input doesn't support raw input, which I missed because I was working off of master.

PR is up at #69

One thing to note - I'll do my best to troubleshoot/fix if any bugs are reported after release of this feature, but could I ping you for help if I get stuck?

Of course

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants