Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate over more than one prefix? #109

Open
hotaru355 opened this issue May 6, 2024 · 5 comments
Open

Iterate over more than one prefix? #109

hotaru355 opened this issue May 6, 2024 · 5 comments
Labels
question Further information is requested

Comments

@hotaru355
Copy link

hotaru355 commented May 6, 2024

Description
Using higher level interfaces, is there a way to iterate over more than one prefix?

Detailed description
First of all, thank you very much for this great library!

I am using ijson to transform a long JSON response from a graphQL server into python objects. Unfortunately, if the server encounters an error, the response status code is still 200 with the response body containing some JSON error message. So, the response can be in one of two formats:

# good response
{"data": [...]}

# bad response
{"errors": [...]}
  1. Is there a way to transform the data for a good response and raise an error in case there is a bad response?

  2. I know I can achieve the desired behaviour using the low-level parse function. The issues I have with this approach is that

    1. It feels like I have to reverse-engineer the higher level interface. Since I would like to process dicts, just like the high-level interfaces provide, iterating the parse results would need to build dictionaries out of token events
    2. I am concerned that I loose the performance benefit from running C instead of python code

    How bad would you say is the performance loss when building dictionaries from token events compared to the C implementation? Is it even worth using ijson in this case?

  3. If not already requested, I would like to follow up this question with a feature request that enhances the high-level interfaces to something like:

for prefix, item in ijson.items(f, {'earth.europe.item', 'earth.america.item'}):
    if prefix == 'earth.europe.item':
      do_something_with_european_country(item)
    elif prefix == 'earth.america.item':
      raise ValueError(f"Did not expect American country: {item}")

Why is this not clear from the documentation
The use case is not mentioned

@hotaru355 hotaru355 added the question Further information is requested label May 6, 2024
@rtobar
Copy link

rtobar commented May 7, 2024

@hotaru355 I'm currently just messaging to acknowledge that I've seen this message and I'm aware of it. I'm however unable to properly reply for a couple of weeks, so don't expect any action on this immediately. Thanks!

@rtobar
Copy link

rtobar commented Jun 7, 2024

An update, a month later, an update: Yes, the idea of having a high-level function allowing users to obtain items from more than one prefix is something that has long been asked for, one way or another. Yesterday I released ijson 3.3.0 with the latest master, and I have a few janitorial tasks I'd like to undertake before tackling this feature, which is something I think would be slightly popular indeed.

I'll keep updating this issue whenever I have something to share, but once again don't hold your breath.

@wodny
Copy link

wodny commented Nov 30, 2024

Hi, I've looked at the code to understand basic concepts of the library and done some experiments. Thought that maybe I could help a bit with the implementation. Before I commit to producing a pull request I would like to ask what is the preferred way to add the feature.

I've focused on the yajl2_c backend.

First change I made was to allow both str and set as the argument to items(). I used items() because it was the easiest to experiment with. Inside items_basecoro_send_impl() I changed PyObject_RichCompareBool() to PySet_Contains(). I've also started yielding a tuple of (prefix, value), not just the value. An actual change would have to keep the original form for backwards compatibility.

This brings me to the actual question - do you think that the pipeline for items() that currently looks like:

	pipeline_node coro_pipeline[] = {
		{&ItemsBasecoro_Type, items_args, NULL},
		{&ParseBasecoro_Type, NULL, NULL},
		{&BasicParseBasecoro_Type, NULL, kwargs},
		{NULL}
	};
  1. should become:
	pipeline_node coro_pipeline[] = {
		{&StripPath_Type, NULL, NULL}, // leave just the value, strip the path
		{&ItemsBasecoro_Type, items_args, NULL},
		{&ParseBasecoro_Type, NULL, NULL},
		{&BasicParseBasecoro_Type, NULL, kwargs},
		{NULL}
	};

with the downside of items() creating tuples it won't finally use (the new multiple_items() would use the old pipeline without stripping)

or

  1. should items_basecoro.c get a sibling like multiple_items_basecoro.c that would mostly copy the code with the exception of getting a set/Iterable as input and (prefix, value) tuples as output.

Both solutions would require items.c to get a sibling like multiple_items.c or items() to accept a param to select type of output. The former seems cleaner.

Another variant would be to return sub-generators that allow something like:

for prefix, items in ijson.multiple_items(f, ["foo.item", "bar.item"]):
    for item in items:
        ...

kvitems would get an analogous solution.

@rtobar
Copy link

rtobar commented Dec 1, 2024

@wodny first of all, thanks a lot for your efforts! I seldom see people delving deep into the library, let alone the C backend, so I'm happy to see someone giving it a shot.

Personally, whenever I add a feature to the library, I do it first for the pure-python backends (basically in ijson/common.py), as it's easier to try concepts there and iterate. But maybe you already did that.

First change I made was to allow both str and set as the argument to items()

Maybe you can be more generous and accept any iterable of strings, then put it internally in a set (and warn if repeated values were given)

the actual question ... the pipeline for items()

The answer is "it depends", as usual, althogh my first reaction would be to go for an option 3, where items gets a multi_items sibling, but that items can internally call directly to avoid a tuple being created only to deconstruct it and extract its second value.

I went to great lengths though the yajl2_c backend to avoid this kind of wasteful tuple creation + unpacking combo when passing the values from one level to the next. I can't recall the actual difference from the benchmarks (do run those! See the README), but it was big enough for me to care about this detail.

goes find a link

Ok, in the penultimate paragraph if https://github.com/ICRAR/ijson/blob/master/notes/design_notes.rst it says that the C backend got 25% faster (median across different dimensions) when avoiding those tuples plus other things. So yes, definitely worth trying to avoid those I'd say.

Having said that, I also realise that option 1 is way easier though. And also thinking more about it, it might be a better strategy to go with it actually. If you write it, we could see how performance decreases for items(), and then we make the decision on what to do for the longer-term.

Once again, thanks a lot for the work you're doing. Let me know if you need more guidance in other parts of the codebase. I have myself a few changes committed locally that I haven't polished enough to push yet, but hopefully won't be clashing with yours. Also, today I'm answering, but I might not be able to actually review code for at least two weeks, and then it's Xmas, NY and all of that... Busy time of the year.

@wodny
Copy link

wodny commented Dec 3, 2024

Personally, whenever I add a feature to the library, I do it first for the pure-python backends (basically in ijson/common.py), as it's easier to try concepts there and iterate. But maybe you already did that.

I thought that I would be able to write Python code one way or another but I have never really used the Python C API for anything serious yet so I wanted to check if I was able to write anything working before I offered to prepare some proposals.

I created the #127 pull request including a description of what I did.

Also, today I'm answering, but I might not be able to actually review code for at least two weeks, and then it's Xmas, NY and all of that... Busy time of the year.

I struggle to find time for fun stuff like this myself. From my point of view there is no hurry. Probably batching tasks is more important than the overall time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants