[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

EthanSteinberg · 2022-10-18T15:34:32Z

This PR is an early stage prototype of using Python 3.7's vectorcall support to reduce function call overhead in pybind11.

This is very early stage, it doesn't support all of pybind11's features (kwargs don't quite work) and is in really rough shape code / documentation wise.

I am using a simple benchmark, included in this PR under the folder pybench, to measure performance.

Time to call a C++ function 100000000 times:

Raw CPython API: 5.2974734380841255 seconds
Pybind11 master Branch: 17.06675560772419 seconds
Pybind11 this PR: 11.004007823765278 seconds

I believe I can get this down even further (to around 8 seconds or so) by optimizing the loader_life_support.

The main downsides of this approach:

We are storing temporary data in function_record objects, so we can't support multiple threads calling the same function_record at a time. This shouldn't matter though as that's always the case due to the GIL
For older versions of Python (<= 3.6), I will have to implement a fallback that will have a tiny drop in performance due to an extra python tuple and python dict allocation and construction.

What are people's thoughts? Should I fully flush out this PR and add full support for the remaining pybind11 functionality? Do people think this is a useful optimization to add to pybind11?

CC @rwgk

rwgk · 2022-10-18T16:26:42Z

For older versions of Python (<= 3.6)

3.6 is already ~10 months past EOL (https://endoflife.date/python). This PR is a good motivation to drop it. You could simply remove the corresponding CI jobs in .github/workflows/ci.yml

I only glanced through the code. At first sight it looks like a relatively small change, given what it does. If you want to drive this to completion, I'll look more, run sanitizers & global testing when you get the CI (minus 3.6) healthy here.

Under the smart_holder branch I already introduced a ubench (micro benchmark) directory. Could you please use the same name here, too? Actually, I think it would make sense to copy over the code and strip out the very few lines for smart_holder, to be added back in under the smart_holder branch (I could do that part). You could also remove the python/number_bucket.clif file.

FWIW ubench results from Feb 2021 are here

include/pybind11/pybind11.h

Skylion007 · 2022-10-18T16:48:48Z

@lalaland If you want to optimize function dispatch, you will want to take a look at this PR too: https://github.com/pybind/pybind11/pull/2013/files/0c8a13d27b7f36dbcaf91833252a5788c0ed429d. It's pretty stale, but it would be great if we could also somehow reduce the number of allocations by using std::array, std::bitset, etc as done in the above PR and get a further cumulative speedup. If that PR was up to date with master, I would gladly merge it along with this one.

One of the reasons @wjakob created https://github.com/wjakob/nanobind/ was to have full supported vectorcall dispatch even if that required the newest versions of Python.

I would definitely onboard with this PR even if we have to drop 3.6 support or if it causes minor binary size increase. We do need some benchmarks that show it's faster (which the linked PR also presented).

Skylion007 · 2022-10-18T17:10:44Z

Lalaland, to try shorten the commit log a bit for this PR, please do the following:

pip install pre-commit
# cd to git project root
pre-commit install

That will run pre-commit locally before every commit.

EthanSteinberg · 2022-10-18T17:13:09Z

Thanks for the feedback. I'll push this PR to completion. I highly recommend both of you turn off notifications for this PR as I'll probably have to spam the CI quite a bit to get this in a working shape on all platforms.

@Skylion007 I purposely don't run pre-commit to trigger a CI run. I don't have CI permissions, but the precommit bot does so if I can trigger the precommit bot, I get a nice CI run.

Skylion007 · 2022-10-18T17:32:56Z

@Skylion007 I purposely don't run pre-commit to trigger a CI run. I don't have CI permissions, but the precommit bot does so if I can trigger the precommit bot, I get a nice CI run.

@lalaland, Ah, once your other PR is merged that shouldn't be an issue anymore. (You should get permission if you have a merged commit).

Skylion007 · 2022-10-19T00:42:52Z

@lalaland You should now have CI permissions.

henryiii · 2022-10-19T23:18:14Z

3.8 added vectorcall. That’s why nanobind requires 3.8+. Totally okay to drop 3.6 any time, but I think we’d need to drop 3.7 too to get vectorcall, which I'm not okay with yet.

include/pybind11/detail/function_record.h

Skylion007

A lot of nits.

As I mentioned before, a lot of the slowdown in master is also from unnecessary memory allocations (std::vector, std::string, etc...) be wary of that as you build out this PR and take a look at the one I linked.

Also be wary of unused template args / such as we are a header only lib so we do want to be careful about slowdown/bloat from unnecessary template instationation. Of course it's fine if those template args are actually used (like for converting an std::vector into an std::array for n_args.

Also you only need the MSVC warning override around stuff that should be constexpr. Unfortunately, there is no easy way to override it in MSVC as the recommended solution is to use constexpr if etc, but that only works if you are > C++14 of course.

Skylion007 · 2022-10-24T14:44:37Z

include/pybind11/detail/common.h

+
+#else
+
+constexpr auto align = std::align;


Suggested change

constexpr auto align = std::align;

using align = std::align;

This doesn't seem to work? Are you sure this is the correct syntax?

Oh sorry, yeah that's the wrong syntax.

Suggested change

constexpr auto align = std::align;

using std::align;

See

pybind11/include/pybind11/detail/common.h

Line 637 in 09db644

using std::conditional_t;

for example

include/pybind11/detail/function_record.h

Skylion007 · 2022-10-24T14:50:07Z

include/pybind11/detail/function_record.h

+        module = std::string(get_child(0).module);
+        name = std::string(get_child(0).name);
+        current_scope = get_child(0).current_scope;
+        has_operator = get_child(0).has_operator;
+        is_constructor = get_child(0).is_constructor;


get_child(0) should be cached. It's weird to cache all these members from the child anyway. If you are going to do this instead of caching the child itself, then maybe have a setter that takes in a child object.

include/pybind11/detail/function_record.h

Skylion007 · 2022-10-24T14:53:26Z

include/pybind11/detail/function_record.h

+template <typename T>
+constexpr T constexpr_min(T a, T b) {
+    return a > b ? b : a;
+}


You could do a C++ version guard with this and do a using alias std::min if C++ >= 14. Some of these details might better level in detail/common.h .

include/pybind11/detail/function_record.h

Skylion007 · 2022-10-24T15:01:22Z

include/pybind11/detail/function_record.h

+    std::string name;
+    std::string desc;
+    object value;     ///< Associated Python object
+    bool convert : 1; ///< True if the argument is allowed to convert when loading
+    bool none : 1;    ///< True if None is allowed when loading
+
+    argument_record() : name(""), desc(""), value(), convert(true), none(true) {}


Suggested change

std::string name;

std::string desc;

object value; ///< Associated Python object

bool convert : 1; ///< True if the argument is allowed to convert when loading

bool none : 1; ///< True if None is allowed when loading

argument_record() : name(""), desc(""), value(), convert(true), none(true) {}

std::string name = "";

std::string desc = "";

object value(); ///< Associated Python object

bool convert : 1 = true; ///< True if the argument is allowed to convert when loading

bool none : 1 = true; ///< True if None is allowed when loading

argument_record() = default;

The bitfields mean that this doesn't work.

warning: default member initializers for bit-fields only available with '-std=c++2a' or '-std=gnu++2a'

Skylion007 · 2022-10-24T15:02:41Z

include/pybind11/attr.h

 };

 template <>
 struct process_attribute<buffer_protocol> : process_attribute_default<buffer_protocol> {
-    static void init(const buffer_protocol &, type_record *r) { r->buffer_protocol = true; }
+    template <typename RecordType>


Wow, these are a lot of template types that are just ignored. Is there a more efficient way to do this that doesn't cause a lot of template initialization? Like making the 2nd arg a void pointer or something?

Skylion007 · 2022-10-24T15:06:06Z

include/pybind11/detail/common.h

+    if (space < size + padding)
+        return nullptr;
+    space -= padding;
+    return ptr = reinterpret_cast<void *>(aligned);


Suggested change

return ptr = reinterpret_cast<void *>(aligned);

return ptr = static_cast<void *>(aligned);

This isn't support on centos7 for some reason.

include/pybind11/detail/python_compat.h

jblespiau · 2022-12-01T00:11:12Z

(FYI, if it's useful, I also extracted and polished https://github.com/jblespiau/pybind11_benchmarks from the nanobind repository, in hope to bring within pybind11 a more extensive benchmarking suite.)

commit 61c7699 Author: Ethan Steinberg <[email protected]> Date: Mon Nov 21 04:30:15 2022 +0000 test it commit 1e6b024 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 07:25:13 2022 +0000 Shoot, forgot to disable test commit e97060a Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 07:14:19 2022 +0000 Final run before I break out the VMs commit 1c0fb6d Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 06:46:54 2022 +0000 Fix a bunch commit 738fc7c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 06:14:27 2022 +0000 Closer commit 3c3b476 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:54:25 2022 +0000 Silence commit 794bcf4 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:35:41 2022 +0000 Fix regex commit 759e6a6 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:26:33 2022 +0000 OOps commit e9d7f8a Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:20:31 2022 +0000 Next commit 0cef232 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:58:47 2022 +0000 Fix more warnings commit 1d20876 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:48:28 2022 +0000 Fix warnings commit 81a6f82 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:38:56 2022 +0000 Fix issues commit 6d44ec9 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:23:42 2022 +0000 Force it to work commit 359742c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:18:09 2022 +0000 Fix scope issue commit 958441c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:10:48 2022 +0000 Oops, more commit 42000d9 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:01:55 2022 +0000 Fix more warnings commit 5357d35 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:40:04 2022 +0000 Continue fixing ci commit e38960d Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon Oct 24 00:22:37 2022 +0000 style: pre-commit fixes commit e5e84e5 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:21:59 2022 +0000 Switch to other optional commit 553fd85 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:03:48 2022 +0000 Temp commit 1ee063f Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 22:47:08 2022 +0000 Next commit c62200c Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 17:28:29 2022 +0000 Temp commit 7cd12aa Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 13:04:28 2022 +0000 Store commit e65c573 Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 12:27:28 2022 +0000 Temp commit b2fe2f0 Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 01:25:03 2022 +0000 Reset commit 86e6b88 Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 23:58:44 2022 +0000 Next commit ee8010b Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 21:11:09 2022 +0000 Temp commit a2caf9f Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 04:25:54 2022 +0000 Next commit feeecc7 Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 03:06:48 2022 +0000 First commit Add benchmark Temp Next What CSontinue Next Temp style: pre-commit fixes

EthanSteinberg · 2022-12-11T05:32:21Z

Looks like this is very close to v1 being ready. Passes all the tests on regular Python and is quite fast to boot.

The main issue is that I need to do a lot more space optimization

Using @jblespiau's test harness

Time:
python: 0.18168139830231667
nanobind: 0.1628
old pybind11: 0.3840818479657173
this PR: 0.22408869490027428

The problem is space
old pybind11: 1.5625
this PR: 6.1168060302734375

rwgk · 2022-12-11T06:20:45Z

old pybind11: 1.5625
this PR: 6.1168060302734375

What exactly is 4 times bigger?

Sorry I don't have a lot of experience with micro benchmarks and binary size comparisons.

EthanSteinberg · 2022-12-11T12:38:52Z

What exactly is 4 times bigger?

The .so files generated when the extensions are complied.

Fixing this should be quite easy. I just need to create non inlined helper functions in several places.

rwgk · 2022-12-11T22:07:39Z

What exactly is 4 times bigger?

The .so files generated when the extensions are complied.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

I just need to create non inlined helper functions in several places.

Hm ... I did measurements back in Aug 2021:

https://docs.google.com/spreadsheets/d/1bkg1HtmU1AjotPTXCCLSbTBGARRet8W91noh1DnKOHU/edit#gid=0

noinline hardly made a difference, but maybe your situation is somehow special?

EthanSteinberg · 2022-12-11T23:13:37Z

noinline hardly made a difference, but maybe your situation is somehow special?

The situation here is special because of the templates involved. So pulling an even small amount of code out of the template (and forcing it out of the template with non-inline) can have huge savings because of the hundreds of copies.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

Keep in mind that these microbenchmarks are sorta contrived. In order to make the sizes / compile times more significant the same code is copied 719 times.

Skylion007 · 2022-12-17T17:35:26Z

noinline hardly made a difference, but maybe your situation is somehow special?

The situation here is special because of the templates involved. So pulling an even small amount of code out of the template (and forcing it out of the template with non-inline) can have huge savings because of the hundreds of copies.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

Keep in mind that these microbenchmarks are sorta contrived. In order to make the sizes / compile times more significant the same code is copied 719 times.

This was also the case in the PR I linked here: #2013 they were eventually able to resolve it.

henryiii · 2024-06-23T04:49:43Z

Python 3.6 now dropped! Planning on dropping 3.7 immediately after the next release, too. That should make this much easier.

EthanSteinberg marked this pull request as draft October 18, 2022 16:04

Skylion007 reviewed Oct 18, 2022

View reviewed changes

include/pybind11/pybind11.h Outdated Show resolved Hide resolved

EthanSteinberg force-pushed the vectorcall branch 3 times, most recently from 717595f to e5e84e5 Compare October 24, 2022 00:22

Skylion007 reviewed Oct 24, 2022

View reviewed changes

include/pybind11/detail/function_record.h Outdated Show resolved Hide resolved

Skylion007 requested changes Oct 24, 2022

View reviewed changes

Skylion007 reviewed Oct 24, 2022

View reviewed changes

include/pybind11/detail/python_compat.h Show resolved Hide resolved

EthanSteinberg mentioned this pull request Oct 25, 2022

Introduce a new style of warning suppression based on push/pop #4285

Merged

EthanSteinberg force-pushed the vectorcall branch from 61c7699 to 3d1d32e Compare November 21, 2022 04:59

EthanSteinberg added 7 commits December 11, 2022 00:59

Minor fixes

d9d8ba6

Compat

3564e72

Temp

1be1483

Tmp

0c5fd3f

Next

041280d

Run through tests

de8202e

EthanSteinberg force-pushed the vectorcall branch from d930755 to de8202e Compare December 11, 2022 01:10

EthanSteinberg added 4 commits December 11, 2022 01:22

Fix tests

7fa12c0

More fixes

84b01f0

Next

981a1d1

More test fixes

15c9fd6

EthanSteinberg added 5 commits December 11, 2022 03:12

Memory leak fix

87636c6

Fix types

067e38e

Fix typing

0e71705

Revert

59817c3

Missed one

b1683d7

EthanSteinberg added 2 commits December 11, 2022 13:24

Try to fix windows test

2d5ba92

Other way

f2ba042

rwgk mentioned this pull request Feb 11, 2023

FWD pybind11 google/pybind11clif#4250

Closed

EthanSteinberg closed this Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

EthanSteinberg commented Oct 18, 2022 •

edited

Loading

rwgk commented Oct 18, 2022

Skylion007 commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 18, 2022 •

edited

Loading

EthanSteinberg commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 19, 2022

henryiii commented Oct 19, 2022 •

edited

Loading

Skylion007 left a comment

Skylion007 Oct 24, 2022

EthanSteinberg Dec 11, 2022

Skylion007 Dec 17, 2022

Skylion007 Oct 24, 2022

Skylion007 Oct 24, 2022

Skylion007 Oct 24, 2022

EthanSteinberg Dec 11, 2022 •

edited

Loading

Skylion007 Oct 24, 2022

Skylion007 Oct 24, 2022

EthanSteinberg Dec 11, 2022

jblespiau commented Dec 1, 2022

EthanSteinberg commented Dec 11, 2022

rwgk commented Dec 11, 2022 •

edited

Loading

EthanSteinberg commented Dec 11, 2022

rwgk commented Dec 11, 2022

EthanSteinberg commented Dec 11, 2022 •

edited

Loading

Skylion007 commented Dec 17, 2022

henryiii commented Jun 23, 2024

	return ptr = reinterpret_cast<void *>(aligned);
	return ptr = static_cast<void *>(aligned);

[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

Conversation

EthanSteinberg commented Oct 18, 2022 • edited Loading

rwgk commented Oct 18, 2022

Skylion007 commented Oct 18, 2022 • edited Loading

Skylion007 commented Oct 18, 2022 • edited Loading

EthanSteinberg commented Oct 18, 2022 • edited Loading

Skylion007 commented Oct 18, 2022 • edited Loading

Skylion007 commented Oct 19, 2022

henryiii commented Oct 19, 2022 • edited Loading

Skylion007 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EthanSteinberg Dec 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jblespiau commented Dec 1, 2022

EthanSteinberg commented Dec 11, 2022

rwgk commented Dec 11, 2022 • edited Loading

EthanSteinberg commented Dec 11, 2022

rwgk commented Dec 11, 2022

EthanSteinberg commented Dec 11, 2022 • edited Loading

Skylion007 commented Dec 17, 2022

henryiii commented Jun 23, 2024

EthanSteinberg commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 18, 2022 •

edited

Loading

EthanSteinberg commented Oct 18, 2022 •

edited

Loading

Skylion007 commented Oct 18, 2022 •

edited

Loading

henryiii commented Oct 19, 2022 •

edited

Loading

EthanSteinberg Dec 11, 2022 •

edited

Loading

rwgk commented Dec 11, 2022 •

edited

Loading

EthanSteinberg commented Dec 11, 2022 •

edited

Loading