feat: iterate over more than one prefix (#109) #127

wodny · 2024-12-03T21:53:07Z

I prepared some proposals to check if the way I tackle the problem from #109 is acceptable.

I implemented the changes in Python and yajl2_c backends.

The commit stack contains two variants:

the "revert using a set for prefix in items" commit preserves the performance of the original code but does not allow multiple prefixes in items(),
one commit earlier, "use set of strings instead of a single string for items prefix", uses sets for both items() and the new prefixed_items(), but has a drawback of performing 10% worse than the original.

I benchmarked the code both using defaults and some input I parse in production which is a map of very long lists of maps for which using items() is essential.

Both versions use items_common.h which contains static inline functions used by items() and the new prefixed_items(). An analogy to the builder.h file. The code is all that I was able to move from items_basecoro.c. Because the code is shared, in the non-set-items() variant a macro is used to define the comparison method (PyObject_RichCompareBool vs PySet_Contains). The shared function is divided into top and bottom halves so the calling side can do a specialized CORO_SEND in between and avoid mangling a generic tuple. Division into halves required one additional state information - pending_builder_reset.

Before I did the items_common.h I tried extending pipelines but the built-in tests changed my mind. All elements of the pipelines have input and output formats already defined and eg. the items basecoro is tested directly with events as inputs.

I think I managed to avoid additional tuple creation/destruction with the exception of the pure Python code where I assumed it won't matter. I can always write two specialized functions.

I extended the list of ifs in parse_basecoro.c to detect the new PrefixedItemsBasecoro:

else if (PrefixedItemsBasecoro_Check(gen->target_send)) {
        prefixed_items_basecoro_send_impl(gen->target_send, prefix, event, value);
}

Everything else is just copying items*.{c,h} as a template and adding the prefixed prefix ;) I made it in a separate commit ("add prefixed_items") because it's the biggest one and I didn't want to mix it with the actual changes.

All tests (including memleaks) pass without modification.

I did not touch the kvitems() part yet nor did I write any tests for prefixed_items() in case you decide the changes should be done in a completely different manner.

Summary by Sourcery

Add support for iterating over multiple prefixes in the ijson library, enhancing data parsing capabilities. This includes new coroutine and generator implementations for prefixed_items, with shared code to maintain efficiency.

New Features:

Introduce the ability to iterate over multiple prefixes in the ijson library, allowing for more flexible data parsing.

Enhancements:

Refactor the coroutine and generator implementations to support the new prefixed_items functionality, sharing common code between items and prefixed_items.

prepare to create prefixed_items_basecoro using the same code

keep set for prefixed_items

sourcery-ai · 2024-12-03T21:53:11Z

Reviewer's Guide by Sourcery

This PR adds support for iterating over multiple prefixes in ijson by introducing a new prefixed_items functionality. The implementation includes both Python and C backend changes, with shared code between items and prefixed_items functionality to maintain code reusability.

Class diagram for the new prefixed_items functionality

classDiagram
    class ItemsCommonBasecoro {
        +builder_t builder
        +int pending_builder_reset
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +yajl2_state *module_state
    }
    class PrefixedItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
        +PyObject* prefixed_items_basecoro_send_impl(PyObject *self, PyObject *path, PyObject *event, PyObject *value)
    }
    class ItemsBasecoro {
        +PyObject *target_send
        +PyObject *prefix
        +int object_depth
        +int pending_builder_reset
        +builder_t builder
        +yajl2_state *module_state
    }
    ItemsCommonBasecoro <|-- PrefixedItemsBasecoro
    ItemsCommonBasecoro <|-- ItemsBasecoro
    PrefixedItemsBasecoro <|-- PrefixedItemsGen
    PrefixedItemsBasecoro <|-- PrefixedItemsAsync
    class PrefixedItemsGen {
        +reading_generator_t reading_gen
    }
    class PrefixedItemsAsync {
        +async_reading_generator *reading_generator
    }

File-Level Changes

Change	Details	Files
Added new prefixed_items functionality to support multiple prefix iteration	Created new prefixed_items_basecoro implementation for handling multiple prefixes Added support for both string and set-based prefix matching Implemented prefixed_items generator and async iterator classes Extended the backend capabilities to include prefixed_items support	`src/ijson/__init__.py` `src/ijson/common.py` `src/ijson/utils35.py` `src/ijson/backends/yajl2_c.py`
Refactored common code between items and prefixed_items implementations	Created items_common.h to share code between items and prefixed_items Split item processing logic into top and bottom halves for better code reuse Introduced pending_builder_reset flag to handle builder state management	`src/ijson/backends/ext/_yajl2/items_common.h` `src/ijson/backends/ext/_yajl2/items_basecoro.c` `src/ijson/backends/ext/_yajl2/items_basecoro.h`
Added C backend implementation for prefixed_items	Created new C files for prefixed_items functionality Implemented prefix matching using PySet_Contains for multiple prefixes Added support for async operations in prefixed_items Updated module initialization to include new prefixed_items types	`src/ijson/backends/ext/_yajl2/prefixed_items.c` `src/ijson/backends/ext/_yajl2/prefixed_items.h` `src/ijson/backends/ext/_yajl2/prefixed_items_async.c` `src/ijson/backends/ext/_yajl2/prefixed_items_async.h` `src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.c` `src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.h` `src/ijson/backends/ext/_yajl2/module.c`

Possibly linked issues

Iterate over more than one prefix? #109: The PR implements a solution to iterate over multiple prefixes, addressing the issue's request.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @wodny - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2024-12-03T21:54:07Z

src/ijson/backends/ext/_yajl2/prefixed_items_basecoro.c

+	N_N(retval = items_common_send_top(self, path, event, value));
+	if (retval == Py_None)
+		Py_RETURN_NONE;
+	PyObject *tuple = PyTuple_Pack(2, path, retval);


suggestion (bug_risk): Add NULL check after PyTuple_Pack

PyTuple_Pack can return NULL on error. Should check the return value and handle the error case appropriately.

Suggested change

PyObject *tuple = PyTuple_Pack(2, path, retval);

PyObject *tuple = PyTuple_Pack(2, path, retval);

if (tuple == NULL) {

return NULL;

}

This is correct, but it is not the only place with lacking checks (kvitems_basecoro.c, parse_basecoro.c). Probably a separate commit correcting all of them at once would be beneficial.

I have created a GitHub issue for your comment: #128

sourcery-ai · 2024-12-03T21:54:07Z

src/ijson/common.py

@@ -244,6 +252,14 @@ def _items_pipeline(backend, prefix, map_type, config):
    )


+def _prefixed_items_pipeline(backend, prefix, map_type, config):


issue (complexity): Consider consolidating the parallel implementations into a single core prefixed pipeline with wrappers for non-prefixed functionality

The parallel implementations for prefixed and non-prefixed items create unnecessary complexity. We can simplify by making the prefixed version the core implementation and removing duplicate pipeline definitions. Here's how:

def _prefixed_items_pipeline(backend, prefix, map_type, config): return ( (backend['prefixed_items_basecoro'], (prefix,), {'map_type': map_type}), (backend['parse_basecoro'], [], {}), (backend['basic_parse_basecoro'], [], config) ) def _items_pipeline(backend, prefix, map_type, config): # Reuse prefixed pipeline with wrapper to strip prefix @utils.coroutine def strip_prefix(target): while True: prefix, value = (yield) target.send(value) return _prefixed_items_pipeline(backend, prefix, map_type, config)[:-1] + \ ((strip_prefix, [], {}),)

This approach:

Makes prefixed_items_pipeline the core implementation

Removes duplicate pipeline logic

Maintains all functionality while reducing code complexity

Keeps the separation between prefixed and non-prefixed interfaces

The same pattern can be applied to the generator and async implementations.

rtobar

@wodny as promised in #109, I took some time to come back to this. Once again, thanks a lot! I was now able to read your description, and have a cursory look at the changes. I'll kick CI now to run, which might flag any potential obvious issues. In the meantime I have only some minor comments for things I saw on passing. From what I saw, I think the solution looks good, although I do wonder whether there's more code we can unify -- that can be done later though.

I missed seeing some tests. The test_items.py test suite should be a good starting point, and you'll probably want to add some tests that are specific to the new functionality. There's also a test that should be added to test_misc.py for the new API entry point.

You said you benchmarked this, so presumably you have some changes for benchmark.py too? And hopefully for dump.py as well -- both of which also have respective tests to be added. All of these should be fairly straightforward though. If you don't feel like adding them, I can do it after the fact. The tests are a must though, otherwise we can't really be sure things are working as they should.

rtobar · 2024-12-23T12:37:32Z

src/ijson/__init__.py

@@ -49,11 +49,13 @@ def _default_backend():
 parse_coro = backend.parse_coro
 items = backend.items
 items_coro = backend.items_coro
+prefixed_items = backend.prefixed_items
+prefixed_items_coro = backend.prefixed_items_coro


the async variant is missing

rtobar · 2024-12-23T12:39:31Z

src/ijson/backends/ext/_yajl2/items_basecoro.c

+	PyObject *retval;
+	N_N(retval = items_common_send_top(self, path, event, value));
+	if (retval == Py_None)
+		Py_RETURN_NONE;


I think this is incrementing a reference to None unnecessarily, you could just return retval

rtobar · 2024-12-23T16:42:26Z

src/ijson/__init__.py

@@ -49,11 +49,13 @@ def _default_backend():
 parse_coro = backend.parse_coro
 items = backend.items
 items_coro = backend.items_coro
+prefixed_items = backend.prefixed_items


This is a good place as any to discuss the name. I'm not convinced by "prefixed items" doesn't convey the fact that now you can specify multiple prefixes. I originally thought of "multi_items" or something along those lines, to make it clearer that there was no restriction on the number of prefixes.

rtobar · 2024-12-24T00:46:16Z

@wodny while you're here, could you rebase on top of the latest master? I updated the CI infra in the meanwhile and macos runners should be running correctly now

wodny added 6 commits December 1, 2024 22:56

move items_basecoro send implementation into a common unit

9b67055

prepare to create prefixed_items_basecoro using the same code

add vim swap files to .gitignore

2417743

add prefixed_items

0f07f17

deduplicate python implementation of (prefixed_)items_basecoro

e48ac48

use set of strings instead of a single string for items prefix

78688c8

revert using a set for prefix in items

c8b6b78

keep set for prefixed_items

sourcery-ai bot reviewed Dec 3, 2024

View reviewed changes

wodny mentioned this pull request Dec 3, 2024

Iterate over more than one prefix? #109

Open

sourcery-ai bot mentioned this pull request Dec 3, 2024

Add NULL checks after PyTuple_Pack calls to prevent potential errors #128

Closed

rtobar reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: iterate over more than one prefix (#109) #127

feat: iterate over more than one prefix (#109) #127

wodny commented Dec 3, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 3, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Dec 3, 2024

wodny Dec 3, 2024

sourcery-ai bot Dec 3, 2024

sourcery-ai bot Dec 3, 2024

rtobar left a comment

rtobar Dec 23, 2024

rtobar Dec 23, 2024

rtobar Dec 23, 2024

rtobar commented Dec 24, 2024

		@@ -244,6 +252,14 @@ def _items_pipeline(backend, prefix, map_type, config):
		)


		def _prefixed_items_pipeline(backend, prefix, map_type, config):

feat: iterate over more than one prefix (#109) #127

Are you sure you want to change the base?

feat: iterate over more than one prefix (#109) #127

Conversation

wodny commented Dec 3, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Dec 3, 2024 • edited Loading

Reviewer's Guide by Sourcery

Class diagram for the new prefixed_items functionality

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Dec 3, 2024

Choose a reason for hiding this comment

wodny Dec 3, 2024

Choose a reason for hiding this comment

sourcery-ai bot Dec 3, 2024

Choose a reason for hiding this comment

sourcery-ai bot Dec 3, 2024

Choose a reason for hiding this comment

rtobar left a comment

Choose a reason for hiding this comment

rtobar Dec 23, 2024

Choose a reason for hiding this comment

rtobar Dec 23, 2024

Choose a reason for hiding this comment

rtobar Dec 23, 2024

Choose a reason for hiding this comment

rtobar commented Dec 24, 2024

wodny commented Dec 3, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 3, 2024 •

edited

Loading