-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make stl.h list|set|map_caster
more user friendly.
#4686
Conversation
…rc.ptr()`). Note for completeness: This is a more conservative change than google/pybind11clif#30042
include/pybind11/stl.h
Outdated
if (!convert) { | ||
return false; | ||
} | ||
if (PyGen_Check(src.ptr())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we really be supporting a py::iterable
object instead? I think a generator would count here and simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely like the general thinking.
Please give me a little more time. I'm in the middle of what I am thinking of as "data driven cleanup".
PyCLIF is very far on the permissive side of the spectrum. — That's what's giving me the data: what is being used, how much.
pybind11 is very far on the strict side.
A glimpse on where I am right now: PyCLIF supports passing any iterable for a C++ set. I think that's going too far to the permissive side of the spectrum. So I'm going through cleaning that up, even though it's a lot of leg work. But I believe it makes the code safer and more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed the equivalent of the behavior I settled on for PyCLIF. I'm still working on rolling out the PyCLIF behavior changes internally. Only after that can I submit internally and then export the changed PyCLIF to github.
High-level: I'm converging the behaviors
- by changing PyCLIF to be stricter & safer,
- while this PR makes pybind11 more user friendly without crossing into unsafe territory.
Shouldn't we really be supporting a
py::iterable
object instead?
That's what PyCLIF used to do, but it has pitfalls that led to lots of unclean test code: Python {}
(empty dict) passed as input for std::vector
and std::set
, []
as input for std::set
and std::map
.
The originally very permissive PyCLIF behavior gave me a fairly unique set of data: I could see what's being used a lot, what makes sense (IMO), and what doesn't make sense or seems unsafe (again IMO). The new code block near the top of stl.h, transferred here from PyCLIF, encodes my conclusions. With that the internal cleanup turned out to be relatively small, only around 50 changes I had to mail out. This is part of the change description for those (slightly edited):
Motivations for the behavior change:
-
Improves safety / prevents accidents:
-
For C++
std::set
(andstd::unordered_set
): The potentially reducing conversion from a Python sequence (e.g. Pythonlist
ortuple
) to a C++ set must be explicit. -
For C++
std::vector
: In very rare cases (itertools.chain
) the conversion from a Python iterable must be explicit.
-
-
Improves readability: Python literals to be passed as C++ sets must be set literals.
I'll finish the rollout of the PyCLIF change, then come back here for polishing maybe and further expanding the tests. Please let me know any high-level suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyCLIF behavior change (with prior cleanup around the codebase) is now complete: google/clif@30c5488
Internally I'm testing (globally) PyCLIF-pybind11 with this PR. Unfortunately some unrelated problem slipped in that I need to troubleshoot first, before I can convince myself that this PR actually fixes all pybind11/stl.h vs PyCLIF behavior differences.
…into `PyObjectTypeIsConvertibleToStdVector()`.
…`, `std::map` conversions. **Please request a rollback only as a last resort**, to buy time for working out a forward fix. Forward fixes are almost certainly trivial: Based on the Python traceback, insert an explicit conversion to `tuple()`, `set()`, or `dict()` (e.g. cl/538602196), or change a Python `set` or `list` literal to the correct type (e.g. cl/538563104). Motivations for the behavior changes: * Improves safety / prevents accidents: * For C++ `std::set` (and `std::unordered_set`): The potentially reducing conversion from a Python sequence (e.g. Python `list` or `tuple`) to a C++ set must be explicit. * For C++ `std::vector` (and `std::array`): In very rare cases (see b/284333274#comment11) the conversion from a Python iterable must be explicit. * For C++ `std::map` (and `std::unordered_map`): The conditions added in this CL (`clif::PyObjectTypeIsConvertibleToStdMap()`) did not break any existing unit tests. * Improves readability: Python literals to be passed as C++ sets must be set literals. * Note for completeness: A Python set can still be passed to a C++ `std::vector`. The rationale is that this conversion is not reducing. Implicit conversions of this kind are also fairly commonly used in google3, therefore enforcing explicit conversions has an unfavorable cost : benefit ratio. * Original trigger for this work: Converge pybind11 & PyCLIF behavior (go/pyclif_pybind11_fusion). * Note that pybind11 is very far on the strict side of the spectrum. See also pybind/pybind11#4686. * PyCLIF was originally extremely far on the permissive side of the spectrum. This CL is already the 2nd cleanup step: prior to cl/525187106 a conversion from any zero-length iterator (e.g. the `{}` empty dict literal) to a C++ `std::vector` or `std::set` was possible. This CL was started from cl/535028028, but the changes under google3/third_party/absl/python were backed out, the change for `std::array` in stltypes.h was added. OSS build tested interactively with copybara to local directory & cmake commands in Ubuntu 20.04 / Python 3.9 docker container. PiperOrigin-RevId: 538823846
Close-reopen to trigger GHA, after adding python dev label. |
…ertibleTo*()` implementations.
Original PR title and description (outdated)Title: WIP: Change list_caster (stl.h) to also accept generator objects. Description: WIP — It is still undecided if this is the best compromise. An easy way to understand the pybind11 behavior change is to look at the small test change under commit 7d30378, fn = m.pass_std_vector_int
- with pytest.raises(TypeError):
- fn(i for i in range(3))
+ assert fn(i for i in range(3)) == 3 What does this help? This PR resolves a very large number of global test failures in connection with the PyCLIF-pybind11 integration work. The exact number is still to be determined, but estimated to exceed 200k. The number of root causes is difficult to estimate, order of magnitude 100 probably.
The large number of existing use cases is also a signal that the feature is definitely useful in aggregate. For full context:
The same test passes with this PR.
|
list_caster
(stl.h) to also accept generator objects.list|set|map_caster
more user friendly.
Demonstration of behavior changesNew tests in test_stl.py adjusted to work with current master (@ 0620d71). diff --git a/tests/test_stl.py b/tests/test_stl.py
index 43615843..4bc4ce6e 100644
--- a/tests/test_stl.py
+++ b/tests/test_stl.py
@@ -386,11 +386,11 @@ def test_return_vector_bool_raw_ptr():
)
def test_pass_std_vector_int(fn, offset):
assert fn([7, 13]) == 140 + offset
- assert fn({6, 2}) == 116 + offset
- assert fn({"x": 8, "y": 11}.values()) == 138 + offset
- assert fn({3: None, 9: None}.keys()) == 124 + offset
- assert fn(i for i in [4, 17]) == 142 + offset
- assert fn(map(lambda i: i * 3, [8, 7])) == 190 + offset # noqa: C417
+ assert fn(tuple({6, 2})) == 116 + offset
+ assert fn(tuple({"x": 8, "y": 11}.values())) == 138 + offset
+ assert fn(tuple({3: None, 9: None}.keys())) == 124 + offset
+ assert fn(tuple(i for i in [4, 17])) == 142 + offset
+ assert fn(tuple(map(lambda i: i * 3, [8, 7]))) == 190 + offset # noqa: C417
with pytest.raises(TypeError):
fn({"x": 0, "y": 1})
with pytest.raises(TypeError):
@@ -399,8 +399,8 @@ def test_pass_std_vector_int(fn, offset):
def test_pass_std_vector_pair_int():
fn = m.pass_std_vector_pair_int
- assert fn({1: 2, 3: 4}.items()) == 406
- assert fn(zip([5, 17], [13, 9])) == 2222
+ assert fn(tuple({1: 2, 3: 4}.items())) == 406
+ assert fn(tuple(zip([5, 17], [13, 9]))) == 2222
def test_list_caster_fully_consumes_generator_object():
@@ -416,7 +416,7 @@ def test_list_caster_fully_consumes_generator_object():
def test_pass_std_set_int():
fn = m.pass_std_set_int
assert fn({3, 15}) == 254
- assert fn({5: None, 12: None}.keys()) == 251
+ assert fn(set({5: None, 12: None}.keys())) == 251
with pytest.raises(TypeError):
fn([])
with pytest.raises(TypeError):
@@ -466,7 +466,7 @@ class FakePyMappingGenObj(FakePyMappingMissingItems):
def test_pass_std_map_int():
fn = m.pass_std_map_int
assert fn({1: 2, 3: 4}) == 4506
- assert fn(FakePyMappingWithItems()) == 3507
+ assert fn(dict(FakePyMappingWithItems().items())) == 3507
with pytest.raises(TypeError):
fn(FakePyMappingMissingItems())
with pytest.raises(TypeError): |
…nally consistent with `PyMapping_Check()`
…pybind11 into list_caster_pass_generator
} | ||
return (PyGen_Check(obj) != 0) || (PyAnySet_Check(obj) != 0) | ||
|| PyObjectIsInstanceWithOneOfTpNames( | ||
obj, {"dict_keys", "dict_values", "dict_items", "map", "zip"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires explicit subclasses, rather than any object that conforms to collections.abc.Sequence
? (Likewise for collections.abc.Mapping
and collections.abc.Set
below) The Pythonic way to do this would be to ask if the object has the methods required by these protocols, rather than checking explicit inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, from https://docs.python.org/3/library/collections.abc.html:
Sequence
: __getitem__
, __len__
, __contains__
, __iter__
, __reversed__
, index
, and count
.
Set
: __contains__
, __iter__
, __len__
, __le__
, __lt__
, __eq__
, __ne__
, __gt__
, __ge__
, __and__
, __or__
, __sub__
, __xor__
, and isdisjoint
.
Mapping
: __getitem__
, __iter__
, __len__
, __contains__
, keys
, items
, values
, get
, __eq__
, and __ne__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what I wrote yesterday, there was actually another concern in the back of my mind: the current implementation is super fast, direct strcmp
calls for 5 (vector
) of 1 (set
) very short string(s).
In contrast, hasattr
is far more complex and probably orders of magnitude slower, although I wouldn't really worry about it in the usual bigger context (it'll not even be a blip on the radar in most cases). The code extra code complexity is the bigger worry.
So there is a code complexity and runtime overhead cost.
What is the benefit?
That brings me back to my "large-scale field experiment" experience hinted in the PR description. In a nutshell:
-
Current pybind11 is super restrictive.
-
Original PyCLIF was super permissive.
I had this huge pile of data from the wild contrasting the two. Based on that I could see very clearly that there is very little to gain from being more permissive than what this PR implements. IOW I don't think paying the code complexity and runtime overhead cost will help in any significant way.
Could you please let me know if you have a different opinion? — It will probably cost me a whole day worth of effort to implement and fully test (global testing) attribute inspection. We will also have to agree on what subsets of the attributes we actually want to test against. I expect that there will be some devils in the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a short path (check inheritance first) could alleviate much of the speed problem - the slower "correct" check could be done if this is not a subclass of X for common X. Also, this list is somewhat arbitrary - you probably included things (like dict_keys
? dict_items
?) based on usage (in PyCLIF), rather than making a comprehensive list of everything iterable in Python? Also, I believe checking the attributes on a class (you don't need to check them on the instance) is pretty fast if you avoid the descriptor protocol (typing.runtime_checkable
switched to this in 3.12 (since 3.2) which made it much faster).
From a usage standpoint, implementing the Protocol specified by these ABC's is supposed to make your class work wherever the standard library classes work, so it seems ideal if this could follow that. Not required (and we could start with this if you really want to), but it seems like the correct thing to do.
We will also have to agree on what subsets of the attributes we actually want to test against.
This is concretely specified in collections.abc
. No arguments needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a short path (check inheritance first) could alleviate much of the speed problem
I agree. Which basically means: this PR plus follow-on work to make the check more permissive.
Do you think it's reasonable then to merge this PR, and do the fallback attribute inspection in a separate PR? — To be completely honest, I wouldn't want to prioritize this, for the reasons explained before. It a significant amount of work, extra code complexity, and I don't think very many people will ever notice the difference in practice. But if someone else wants to invest the effort, I'd be on board.
This is concretely specified in
collections.abc
. No arguments needed.
I see, sounds good to me, but from a purely practical viewpoint, assuming that most duck-typing only implements a subset of the attributes, insisting on the full set of attributes here probably means nobody will ever see a difference in practice, unless they are specifically looking to pass the checks here.
Thanks for the feedback @henryiii! To explain briefly: I was completely focused on making pybind11 more permissive "just enough" to avoid large-scale changes around the Google codebase, changes that would probably be viewed as annoying regressions by users. I agree aligning with |
…t this minute) pybind#4686
See pybind#4686 for background.
@nevedaren (new owner of Python/C++ bindings at Google) FYI Hi @henryiii, @Skylion007, @EthanSteinberg, this is the second of two PRs that I could not easily remove from the Google codebase (the other one was PR #4597). I'm coming back to this PR at this time because I'm leaving Google soon. I'm working intensely on making the hand-over as smooth as possible, and leaving pybind11 at Google in an easily maintainable state. A few days ago (Aug 7), I already switched Google back from the github.com/google fork to the new (#5257) smart_holder branch, but I had to carry this PR as a patch. Merging it now will remove that last patch. Please let me know if you think this could cause a problem. — I believe problems are very unlikely, because this PR has been in production use at Google for over a year, and it was reviewed by Google engineers before it was deployed. In addition to what I wrote in the PR description above, in the meantime new client code - manually written pybind11 bindings - became dependent on this PR. It wouldn't be too difficult to mail out workarounds (explicitly inserting I don't really want do extra work and take that risk especially because it seems that the feature change is actually considered useful: Looking at the comments here again (in particular this), I believe the feedback implies that eliminating the need for explicit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend making the mentioned followup at some point, but otherwise I think it's fine.
Thanks Henry! |
list|set|map_caster
more user friendly.list|set|map_caster
more user friendly.
Description
(The equivalent of this PR was merged as google/pybind11clif#30045 on July 17, 2023.)
With this PR, it is no longer necessary to explicitly convert Python iterables to
tuple()
,set()
, ormap()
in several common situations, for example:See #4686 (comment) below for a more complete demonstration of the behavior differences. (Note that the diff in that comment is reversed, what is
+
here is-
there and vice versa.)This PR strikes a carefully crafted compromise between being user friendly and being safe. This compromise was informed by 10s of thousands of use cases in the wild (Google codebase, which includes a very large number of open source third-party packages). Concretely, in connection with PyCLIF-pybind11 integration work, the behaviors of PyCLIF (http://github.com/google/clif) and pybind11 needed to be converged. Originally PyCLIF was extremely far on the permissive side of the spectrum, by accepting any Python iterable as input for a C++
vector
/set
/map
argument, as long as the elements were convertible. The obvious (in hindsight) problem was that any empty Python iterable could be passed to any of these C++ types, e.g.{}
was accepted for C++vector
/set
arguments, or[]
for C++map
arguments. To fix this, the code base was cleaned up, and gating functions were added to enforce the desired behavior:The exact same gating functions are adopted here. Without this behavior change in pybind11, it would be necessary to insert a very large number of explicit
tuple()
,set()
,map()
conversions in the Google codebase, to complete the PyCLIF-pybind11 integration work. While it is very easy to tell new users to pass e.g.tuple(iterable)
instead ofiterable
, it is not easy to convince hundreds of happy existing PyCLIF users to accept retroactively inserting the explicit conversions. The obvious question that will come back: Why do we have to clutter our code like this? There is no good reason. While the exact implementation of the gating functions may seem a little arbitrary at first glance, it is in fact informed by what amounts to a large-scale field experiment, with the originally very permissive behavior of PyCLIF.Suggested changelog entry: