Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: iterate over more than one prefix (#109) #127

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wodny
Copy link

@wodny wodny commented Dec 3, 2024

I prepared some proposals to check if the way I tackle the problem from #109 is acceptable.

I implemented the changes in Python and yajl2_c backends.

The commit stack contains two variants:

  • the "revert using a set for prefix in items" commit preserves the performance of the original code but does not allow multiple prefixes in items(),
  • one commit earlier, "use set of strings instead of a single string for items prefix", uses sets for both items() and the new prefixed_items(), but has a drawback of performing 10% worse than the original.

I benchmarked the code both using defaults and some input I parse in production which is a map of very long lists of maps for which using items() is essential.

Both versions use items_common.h which contains static inline functions used by items() and the new prefixed_items(). An analogy to the builder.h file. The code is all that I was able to move from items_basecoro.c. Because the code is shared, in the non-set-items() variant a macro is used to define the comparison method (PyObject_RichCompareBool vs PySet_Contains). The shared function is divided into top and bottom halves so the calling side can do a specialized CORO_SEND in between and avoid mangling a generic tuple. Division into halves required one additional state information - pending_builder_reset.

Before I did the items_common.h I tried extending pipelines but the built-in tests changed my mind. All elements of the pipelines have input and output formats already defined and eg. the items basecoro is tested directly with events as inputs.

I think I managed to avoid additional tuple creation/destruction with the exception of the pure Python code where I assumed it won't matter. I can always write two specialized functions.

I extended the list of ifs in parse_basecoro.c to detect the new PrefixedItemsBasecoro:

else if (PrefixedItemsBasecoro_Check(gen->target_send)) {
        prefixed_items_basecoro_send_impl(gen->target_send, prefix, event, value);
}

Everything else is just copying items*.{c,h} as a template and adding the prefixed prefix ;) I made it in a separate commit ("add prefixed_items") because it's the biggest one and I didn't want to mix it with the actual changes.

All tests (including memleaks) pass without modification.

I did not touch the kvitems() part yet nor did I write any tests for prefixed_items() in case you decide the changes should be done in a completely different manner.

Summary by Sourcery

Add support for iterating over multiple prefixes in the ijson library, enhancing data parsing capabilities. This includes new coroutine and generator implementations for prefixed_items, with shared code to maintain efficiency.

New Features:

  • Introduce the ability to iterate over multiple prefixes in the ijson library, allowing for more flexible data parsing.

Enhancements:

  • Refactor the coroutine and generator implementations to support the new prefixed_items functionality, sharing common code between items and prefixed_items.

Copy link

sourcery-ai bot commented Dec 3, 2024

Reviewer's Guide by Sourcery

This PR adds support for iterating over multiple prefixes in ijson by introducing a new prefixed_items functionality. The implementation includes both Python and C backend changes, with shared code between items and prefixed_items functionality to maintain code reusability.

Class diagram for the new prefixed_items functionality

classDiagram
    class ItemsCommonBasecoro {
        +builder_t builder
        +int pending_builder_reset
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +yajl2_state *module_state
    }
    class PrefixedItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
        +PyObject* prefixed_items_basecoro_send_impl(PyObject *self, PyObject *path, PyObject *event, PyObject *value)
    }
    class ItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
    }
    ItemsCommonBasecoro <|-- PrefixedItemsBasecoro
    ItemsCommonBasecoro <|-- ItemsBasecoro
    PrefixedItemsBasecoro <|-- PrefixedItemsGen
    PrefixedItemsBasecoro <|-- PrefixedItemsAsync
    class PrefixedItemsGen {
        +reading_generator_t reading_gen
    }
    class PrefixedItemsAsync {
        +async_reading_generator *reading_generator
    }
Loading

File-Level Changes

Change Details Files
Added new prefixed_items functionality to support multiple prefix iteration
  • Created new prefixed_items_basecoro implementation for handling multiple prefixes
  • Added support for both string and set-based prefix matching
  • Implemented prefixed_items generator and async iterator classes
  • Extended the backend capabilities to include prefixed_items support
src/ijson/__init__.py
src/ijson/common.py
src/ijson/utils35.py
src/ijson/backends/yajl2_c.py
Refactored common code between items and prefixed_items implementations
  • Created items_common.h to share code between items and prefixed_items
  • Split item processing logic into top and bottom halves for better code reuse
  • Introduced pending_builder_reset flag to handle builder state management
src/ijson/backends/ext/_yajl2/items_common.h
src/ijson/backends/ext/_yajl2/items_basecoro.c
src/ijson/backends/ext/_yajl2/items_basecoro.h
Added C backend implementation for prefixed_items
  • Created new C files for prefixed_items functionality
  • Implemented prefix matching using PySet_Contains for multiple prefixes
  • Added support for async operations in prefixed_items
  • Updated module initialization to include new prefixed_items types
src/ijson/backends/ext/_yajl2/prefixed_items.c
src/ijson/backends/ext/_yajl2/prefixed_items.h
src/ijson/backends/ext/_yajl2/prefixed_items_async.c
src/ijson/backends/ext/_yajl2/prefixed_items_async.h
src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.c
src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.h
src/ijson/backends/ext/_yajl2/module.c

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wodny - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

N_N(retval = items_common_send_top(self, path, event, value));
if (retval == Py_None)
Py_RETURN_NONE;
PyObject *tuple = PyTuple_Pack(2, path, retval);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Add NULL check after PyTuple_Pack

PyTuple_Pack can return NULL on error. Should check the return value and handle the error case appropriately.

Suggested change
PyObject *tuple = PyTuple_Pack(2, path, retval);
PyObject *tuple = PyTuple_Pack(2, path, retval);
if (tuple == NULL) {
return NULL;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but it is not the only place with lacking checks (kvitems_basecoro.c, parse_basecoro.c). Probably a separate commit correcting all of them at once would be beneficial.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a GitHub issue for your comment: #128

@@ -244,6 +252,14 @@ def _items_pipeline(backend, prefix, map_type, config):
)


def _prefixed_items_pipeline(backend, prefix, map_type, config):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider consolidating the parallel implementations into a single core prefixed pipeline with wrappers for non-prefixed functionality

The parallel implementations for prefixed and non-prefixed items create unnecessary complexity. We can simplify by making the prefixed version the core implementation and removing duplicate pipeline definitions. Here's how:

def _prefixed_items_pipeline(backend, prefix, map_type, config):
    return (
        (backend['prefixed_items_basecoro'], (prefix,), {'map_type': map_type}),
        (backend['parse_basecoro'], [], {}),
        (backend['basic_parse_basecoro'], [], config)
    )

def _items_pipeline(backend, prefix, map_type, config):
    # Reuse prefixed pipeline with wrapper to strip prefix
    @utils.coroutine
    def strip_prefix(target):
        while True:
            prefix, value = (yield)
            target.send(value)

    return _prefixed_items_pipeline(backend, prefix, map_type, config)[:-1] + \
           ((strip_prefix, [], {}),)

This approach:

  1. Makes prefixed_items_pipeline the core implementation
  2. Removes duplicate pipeline logic
  3. Maintains all functionality while reducing code complexity
  4. Keeps the separation between prefixed and non-prefixed interfaces

The same pattern can be applied to the generator and async implementations.

Copy link

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wodny as promised in #109, I took some time to come back to this. Once again, thanks a lot! I was now able to read your description, and have a cursory look at the changes. I'll kick CI now to run, which might flag any potential obvious issues. In the meantime I have only some minor comments for things I saw on passing. From what I saw, I think the solution looks good, although I do wonder whether there's more code we can unify -- that can be done later though.

I missed seeing some tests. The test_items.py test suite should be a good starting point, and you'll probably want to add some tests that are specific to the new functionality. There's also a test that should be added to test_misc.py for the new API entry point.

You said you benchmarked this, so presumably you have some changes for benchmark.py too? And hopefully for dump.py as well -- both of which also have respective tests to be added. All of these should be fairly straightforward though. If you don't feel like adding them, I can do it after the fact. The tests are a must though, otherwise we can't really be sure things are working as they should.

@@ -49,11 +49,13 @@ def _default_backend():
parse_coro = backend.parse_coro
items = backend.items
items_coro = backend.items_coro
prefixed_items = backend.prefixed_items
prefixed_items_coro = backend.prefixed_items_coro
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the async variant is missing

PyObject *retval;
N_N(retval = items_common_send_top(self, path, event, value));
if (retval == Py_None)
Py_RETURN_NONE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incrementing a reference to None unnecessarily, you could just return retval

@@ -49,11 +49,13 @@ def _default_backend():
parse_coro = backend.parse_coro
items = backend.items
items_coro = backend.items_coro
prefixed_items = backend.prefixed_items
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good place as any to discuss the name. I'm not convinced by "prefixed items" doesn't convey the fact that now you can specify multiple prefixes. I originally thought of "multi_items" or something along those lines, to make it clearer that there was no restriction on the number of prefixes.

@rtobar
Copy link

rtobar commented Dec 24, 2024

@wodny while you're here, could you rebase on top of the latest master? I updated the CI infra in the meanwhile and macos runners should be running correctly now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants