Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: iterate over more than one prefix (#109) #127

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wodny
Copy link

@wodny wodny commented Dec 3, 2024

I prepared some proposals to check if the way I tackle the problem from #109 is acceptable.

I implemented the changes in Python and yajl2_c backends.

The commit stack contains two variants:

  • the "revert using a set for prefix in items" commit preserves the performance of the original code but does not allow multiple prefixes in items(),
  • one commit earlier, "use set of strings instead of a single string for items prefix", uses sets for both items() and the new prefixed_items(), but has a drawback of performing 10% worse than the original.

I benchmarked the code both using defaults and some input I parse in production which is a map of very long lists of maps for which using items() is essential.

Both versions use items_common.h which contains static inline functions used by items() and the new prefixed_items(). An analogy to the builder.h file. The code is all that I was able to move from items_basecoro.c. Because the code is shared, in the non-set-items() variant a macro is used to define the comparison method (PyObject_RichCompareBool vs PySet_Contains). The shared function is divided into top and bottom halves so the calling side can do a specialized CORO_SEND in between and avoid mangling a generic tuple. Division into halves required one additional state information - pending_builder_reset.

Before I did the items_common.h I tried extending pipelines but the built-in tests changed my mind. All elements of the pipelines have input and output formats already defined and eg. the items basecoro is tested directly with events as inputs.

I think I managed to avoid additional tuple creation/destruction with the exception of the pure Python code where I assumed it won't matter. I can always write two specialized functions.

I extended the list of ifs in parse_basecoro.c to detect the new PrefixedItemsBasecoro:

else if (PrefixedItemsBasecoro_Check(gen->target_send)) {
        prefixed_items_basecoro_send_impl(gen->target_send, prefix, event, value);
}

Everything else is just copying items*.{c,h} as a template and adding the prefixed prefix ;) I made it in a separate commit ("add prefixed_items") because it's the biggest one and I didn't want to mix it with the actual changes.

All tests (including memleaks) pass without modification.

I did not touch the kvitems() part yet nor did I write any tests for prefixed_items() in case you decide the changes should be done in a completely different manner.

Summary by Sourcery

Add support for iterating over multiple prefixes in the ijson library, enhancing data parsing capabilities. This includes new coroutine and generator implementations for prefixed_items, with shared code to maintain efficiency.

New Features:

  • Introduce the ability to iterate over multiple prefixes in the ijson library, allowing for more flexible data parsing.

Enhancements:

  • Refactor the coroutine and generator implementations to support the new prefixed_items functionality, sharing common code between items and prefixed_items.

Copy link

sourcery-ai bot commented Dec 3, 2024

Reviewer's Guide by Sourcery

This PR adds support for iterating over multiple prefixes in ijson by introducing a new prefixed_items functionality. The implementation includes both Python and C backend changes, with shared code between items and prefixed_items functionality to maintain code reusability.

Class diagram for the new prefixed_items functionality

classDiagram
    class ItemsCommonBasecoro {
        +builder_t builder
        +int pending_builder_reset
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +yajl2_state *module_state
    }
    class PrefixedItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
        +PyObject* prefixed_items_basecoro_send_impl(PyObject *self, PyObject *path, PyObject *event, PyObject *value)
    }
    class ItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
    }
    ItemsCommonBasecoro <|-- PrefixedItemsBasecoro
    ItemsCommonBasecoro <|-- ItemsBasecoro
    PrefixedItemsBasecoro <|-- PrefixedItemsGen
    PrefixedItemsBasecoro <|-- PrefixedItemsAsync
    class PrefixedItemsGen {
        +reading_generator_t reading_gen
    }
    class PrefixedItemsAsync {
        +async_reading_generator *reading_generator
    }
Loading

File-Level Changes

Change Details Files
Added new prefixed_items functionality to support multiple prefix iteration
  • Created new prefixed_items_basecoro implementation for handling multiple prefixes
  • Added support for both string and set-based prefix matching
  • Implemented prefixed_items generator and async iterator classes
  • Extended the backend capabilities to include prefixed_items support
src/ijson/__init__.py
src/ijson/common.py
src/ijson/utils35.py
src/ijson/backends/yajl2_c.py
Refactored common code between items and prefixed_items implementations
  • Created items_common.h to share code between items and prefixed_items
  • Split item processing logic into top and bottom halves for better code reuse
  • Introduced pending_builder_reset flag to handle builder state management
src/ijson/backends/ext/_yajl2/items_common.h
src/ijson/backends/ext/_yajl2/items_basecoro.c
src/ijson/backends/ext/_yajl2/items_basecoro.h
Added C backend implementation for prefixed_items
  • Created new C files for prefixed_items functionality
  • Implemented prefix matching using PySet_Contains for multiple prefixes
  • Added support for async operations in prefixed_items
  • Updated module initialization to include new prefixed_items types
src/ijson/backends/ext/_yajl2/prefixed_items.c
src/ijson/backends/ext/_yajl2/prefixed_items.h
src/ijson/backends/ext/_yajl2/prefixed_items_async.c
src/ijson/backends/ext/_yajl2/prefixed_items_async.h
src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.c
src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.h
src/ijson/backends/ext/_yajl2/module.c

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wodny - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

N_N(retval = items_common_send_top(self, path, event, value));
if (retval == Py_None)
Py_RETURN_NONE;
PyObject *tuple = PyTuple_Pack(2, path, retval);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Add NULL check after PyTuple_Pack

PyTuple_Pack can return NULL on error. Should check the return value and handle the error case appropriately.

Suggested change
PyObject *tuple = PyTuple_Pack(2, path, retval);
PyObject *tuple = PyTuple_Pack(2, path, retval);
if (tuple == NULL) {
return NULL;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but it is not the only place with lacking checks (kvitems_basecoro.c, parse_basecoro.c). Probably a separate commit correcting all of them at once would be beneficial.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a GitHub issue for your comment: #128

@@ -244,6 +252,14 @@ def _items_pipeline(backend, prefix, map_type, config):
)


def _prefixed_items_pipeline(backend, prefix, map_type, config):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider consolidating the parallel implementations into a single core prefixed pipeline with wrappers for non-prefixed functionality

The parallel implementations for prefixed and non-prefixed items create unnecessary complexity. We can simplify by making the prefixed version the core implementation and removing duplicate pipeline definitions. Here's how:

def _prefixed_items_pipeline(backend, prefix, map_type, config):
    return (
        (backend['prefixed_items_basecoro'], (prefix,), {'map_type': map_type}),
        (backend['parse_basecoro'], [], {}),
        (backend['basic_parse_basecoro'], [], config)
    )

def _items_pipeline(backend, prefix, map_type, config):
    # Reuse prefixed pipeline with wrapper to strip prefix
    @utils.coroutine
    def strip_prefix(target):
        while True:
            prefix, value = (yield)
            target.send(value)

    return _prefixed_items_pipeline(backend, prefix, map_type, config)[:-1] + \
           ((strip_prefix, [], {}),)

This approach:

  1. Makes prefixed_items_pipeline the core implementation
  2. Removes duplicate pipeline logic
  3. Maintains all functionality while reducing code complexity
  4. Keeps the separation between prefixed and non-prefixed interfaces

The same pattern can be applied to the generator and async implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant