Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250

Closed
wants to merge 18 commits into from

Conversation

EthanSteinberg
Copy link
Collaborator

@EthanSteinberg EthanSteinberg commented Oct 18, 2022

This PR is an early stage prototype of using Python 3.7's vectorcall support to reduce function call overhead in pybind11.

This is very early stage, it doesn't support all of pybind11's features (kwargs don't quite work) and is in really rough shape code / documentation wise.

I am using a simple benchmark, included in this PR under the folder pybench, to measure performance.

Time to call a C++ function 100000000 times:

Raw CPython API: 5.2974734380841255 seconds
Pybind11 master Branch: 17.06675560772419 seconds
Pybind11 this PR: 11.004007823765278 seconds

I believe I can get this down even further (to around 8 seconds or so) by optimizing the loader_life_support.

The main downsides of this approach:

  • We are storing temporary data in function_record objects, so we can't support multiple threads calling the same function_record at a time. This shouldn't matter though as that's always the case due to the GIL

  • For older versions of Python (<= 3.6), I will have to implement a fallback that will have a tiny drop in performance due to an extra python tuple and python dict allocation and construction.

What are people's thoughts? Should I fully flush out this PR and add full support for the remaining pybind11 functionality? Do people think this is a useful optimization to add to pybind11?

CC @rwgk

@EthanSteinberg EthanSteinberg marked this pull request as draft October 18, 2022 16:04
@rwgk
Copy link
Collaborator

rwgk commented Oct 18, 2022

For older versions of Python (<= 3.6)

3.6 is already ~10 months past EOL (https://endoflife.date/python). This PR is a good motivation to drop it. You could simply remove the corresponding CI jobs in .github/workflows/ci.yml

I only glanced through the code. At first sight it looks like a relatively small change, given what it does. If you want to drive this to completion, I'll look more, run sanitizers & global testing when you get the CI (minus 3.6) healthy here.

Under the smart_holder branch I already introduced a ubench (micro benchmark) directory. Could you please use the same name here, too? Actually, I think it would make sense to copy over the code and strip out the very few lines for smart_holder, to be added back in under the smart_holder branch (I could do that part). You could also remove the python/number_bucket.clif file.

FWIW ubench results from Feb 2021 are here

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 18, 2022

@lalaland If you want to optimize function dispatch, you will want to take a look at this PR too: https://github.com/pybind/pybind11/pull/2013/files/0c8a13d27b7f36dbcaf91833252a5788c0ed429d. It's pretty stale, but it would be great if we could also somehow reduce the number of allocations by using std::array, std::bitset, etc as done in the above PR and get a further cumulative speedup. If that PR was up to date with master, I would gladly merge it along with this one.

One of the reasons @wjakob created https://github.com/wjakob/nanobind/ was to have full supported vectorcall dispatch even if that required the newest versions of Python.

I would definitely onboard with this PR even if we have to drop 3.6 support or if it causes minor binary size increase. We do need some benchmarks that show it's faster (which the linked PR also presented).

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 18, 2022

Lalaland, to try shorten the commit log a bit for this PR, please do the following:

pip install pre-commit
# cd to git project root
pre-commit install

That will run pre-commit locally before every commit.

@EthanSteinberg
Copy link
Collaborator Author

EthanSteinberg commented Oct 18, 2022

Thanks for the feedback. I'll push this PR to completion. I highly recommend both of you turn off notifications for this PR as I'll probably have to spam the CI quite a bit to get this in a working shape on all platforms.

@Skylion007 I purposely don't run pre-commit to trigger a CI run. I don't have CI permissions, but the precommit bot does so if I can trigger the precommit bot, I get a nice CI run.

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 18, 2022

@Skylion007 I purposely don't run pre-commit to trigger a CI run. I don't have CI permissions, but the precommit bot does so if I can trigger the precommit bot, I get a nice CI run.

@lalaland, Ah, once your other PR is merged that shouldn't be an issue anymore. (You should get permission if you have a merged commit).

@Skylion007
Copy link
Collaborator

@lalaland You should now have CI permissions.

@henryiii
Copy link
Collaborator

henryiii commented Oct 19, 2022

3.8 added vectorcall. That’s why nanobind requires 3.8+. Totally okay to drop 3.6 any time, but I think we’d need to drop 3.7 too to get vectorcall, which I'm not okay with yet.

@EthanSteinberg EthanSteinberg force-pushed the vectorcall branch 3 times, most recently from 717595f to e5e84e5 Compare October 24, 2022 00:22
Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of nits.

As I mentioned before, a lot of the slowdown in master is also from unnecessary memory allocations (std::vector, std::string, etc...) be wary of that as you build out this PR and take a look at the one I linked.

Also be wary of unused template args / such as we are a header only lib so we do want to be careful about slowdown/bloat from unnecessary template instationation. Of course it's fine if those template args are actually used (like for converting an std::vector into an std::array for n_args.

Also you only need the MSVC warning override around stuff that should be constexpr. Unfortunately, there is no easy way to override it in MSVC as the recommended solution is to use constexpr if etc, but that only works if you are > C++14 of course.


#else

constexpr auto align = std::align;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr auto align = std::align;
using align = std::align;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to work? Are you sure this is the correct syntax?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, yeah that's the wrong syntax.

Suggested change
constexpr auto align = std::align;
using std::align;

See

using std::conditional_t;
for example

include/pybind11/detail/function_record.h Outdated Show resolved Hide resolved
Comment on lines +480 to +483
module = std::string(get_child(0).module);
name = std::string(get_child(0).name);
current_scope = get_child(0).current_scope;
has_operator = get_child(0).has_operator;
is_constructor = get_child(0).is_constructor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_child(0) should be cached. It's weird to cache all these members from the child anyway. If you are going to do this instead of caching the child itself, then maybe have a setter that takes in a child object.

include/pybind11/detail/function_record.h Show resolved Hide resolved
include/pybind11/detail/function_record.h Outdated Show resolved Hide resolved
Comment on lines +23 to +26
template <typename T>
constexpr T constexpr_min(T a, T b) {
return a > b ? b : a;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do a C++ version guard with this and do a using alias std::min if C++ >= 14. Some of these details might better level in detail/common.h .

include/pybind11/detail/function_record.h Outdated Show resolved Hide resolved
Comment on lines +39 to +45
std::string name;
std::string desc;
object value; ///< Associated Python object
bool convert : 1; ///< True if the argument is allowed to convert when loading
bool none : 1; ///< True if None is allowed when loading

argument_record() : name(""), desc(""), value(), convert(true), none(true) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string name;
std::string desc;
object value; ///< Associated Python object
bool convert : 1; ///< True if the argument is allowed to convert when loading
bool none : 1; ///< True if None is allowed when loading
argument_record() : name(""), desc(""), value(), convert(true), none(true) {}
std::string name = "";
std::string desc = "";
object value(); ///< Associated Python object
bool convert : 1 = true; ///< True if the argument is allowed to convert when loading
bool none : 1 = true; ///< True if None is allowed when loading
argument_record() = default;

Copy link
Collaborator Author

@EthanSteinberg EthanSteinberg Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bitfields mean that this doesn't work.

warning: default member initializers for bit-fields only available with '-std=c++2a' or '-std=gnu++2a'

};

template <>
struct process_attribute<buffer_protocol> : process_attribute_default<buffer_protocol> {
static void init(const buffer_protocol &, type_record *r) { r->buffer_protocol = true; }
template <typename RecordType>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, these are a lot of template types that are just ignored. Is there a more efficient way to do this that doesn't cause a lot of template initialization? Like making the 2nd arg a void pointer or something?

if (space < size + padding)
return nullptr;
space -= padding;
return ptr = reinterpret_cast<void *>(aligned);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ptr = reinterpret_cast<void *>(aligned);
return ptr = static_cast<void *>(aligned);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't support on centos7 for some reason.

@jblespiau
Copy link
Contributor

(FYI, if it's useful, I also extracted and polished https://github.com/jblespiau/pybind11_benchmarks from the nanobind repository, in hope to bring within pybind11 a more extensive benchmarking suite.)

commit 61c7699
Author: Ethan Steinberg <[email protected]>
Date:   Mon Nov 21 04:30:15 2022 +0000

    test it

commit 1e6b024
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 07:25:13 2022 +0000

    Shoot, forgot to disable test

commit e97060a
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 07:14:19 2022 +0000

    Final run before I break out the VMs

commit 1c0fb6d
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 06:46:54 2022 +0000

    Fix a bunch

commit 738fc7c
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 06:14:27 2022 +0000

    Closer

commit 3c3b476
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 05:54:25 2022 +0000

    Silence

commit 794bcf4
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 05:35:41 2022 +0000

    Fix regex

commit 759e6a6
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 05:26:33 2022 +0000

    OOps

commit e9d7f8a
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 05:20:31 2022 +0000

    Next

commit 0cef232
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 04:58:47 2022 +0000

    Fix more warnings

commit 1d20876
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 04:48:28 2022 +0000

    Fix warnings

commit 81a6f82
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 04:38:56 2022 +0000

    Fix issues

commit 6d44ec9
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 01:23:42 2022 +0000

    Force it to work

commit 359742c
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 01:18:09 2022 +0000

    Fix scope issue

commit 958441c
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 01:10:48 2022 +0000

    Oops, more

commit 42000d9
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 01:01:55 2022 +0000

    Fix more warnings

commit 5357d35
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 00:40:04 2022 +0000

    Continue fixing ci

commit e38960d
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Mon Oct 24 00:22:37 2022 +0000

    style: pre-commit fixes

commit e5e84e5
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 00:21:59 2022 +0000

    Switch to other optional

commit 553fd85
Author: Ethan Steinberg <[email protected]>
Date:   Mon Oct 24 00:03:48 2022 +0000

    Temp

commit 1ee063f
Author: Ethan Steinberg <[email protected]>
Date:   Sun Oct 23 22:47:08 2022 +0000

    Next

commit c62200c
Author: Ethan Steinberg <[email protected]>
Date:   Sun Oct 23 17:28:29 2022 +0000

    Temp

commit 7cd12aa
Author: Ethan Steinberg <[email protected]>
Date:   Sun Oct 23 13:04:28 2022 +0000

    Store

commit e65c573
Author: Ethan Steinberg <[email protected]>
Date:   Sun Oct 23 12:27:28 2022 +0000

    Temp

commit b2fe2f0
Author: Ethan Steinberg <[email protected]>
Date:   Sun Oct 23 01:25:03 2022 +0000

    Reset

commit 86e6b88
Author: Ethan Steinberg <[email protected]>
Date:   Sat Oct 22 23:58:44 2022 +0000

    Next

commit ee8010b
Author: Ethan Steinberg <[email protected]>
Date:   Sat Oct 22 21:11:09 2022 +0000

    Temp

commit a2caf9f
Author: Ethan Steinberg <[email protected]>
Date:   Sat Oct 22 04:25:54 2022 +0000

    Next

commit feeecc7
Author: Ethan Steinberg <[email protected]>
Date:   Sat Oct 22 03:06:48 2022 +0000

    First commit

    Add benchmark

    Temp

    Next

    What

    CSontinue

    Next

    Temp

    style: pre-commit fixes
@EthanSteinberg
Copy link
Collaborator Author

Looks like this is very close to v1 being ready. Passes all the tests on regular Python and is quite fast to boot.

The main issue is that I need to do a lot more space optimization

Using @jblespiau's test harness

Time:
python: 0.18168139830231667
nanobind: 0.1628
old pybind11: 0.3840818479657173
this PR: 0.22408869490027428

The problem is space
old pybind11: 1.5625
this PR: 6.1168060302734375

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

old pybind11: 1.5625
this PR: 6.1168060302734375

What exactly is 4 times bigger?

Sorry I don't have a lot of experience with micro benchmarks and binary size comparisons.

@EthanSteinberg
Copy link
Collaborator Author

What exactly is 4 times bigger?

The .so files generated when the extensions are complied.

Fixing this should be quite easy. I just need to create non inlined helper functions in several places.

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

What exactly is 4 times bigger?

The .so files generated when the extensions are complied.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

I just need to create non inlined helper functions in several places.

Hm ... I did measurements back in Aug 2021:

https://docs.google.com/spreadsheets/d/1bkg1HtmU1AjotPTXCCLSbTBGARRet8W91noh1DnKOHU/edit#gid=0

noinline hardly made a difference, but maybe your situation is somehow special?

@EthanSteinberg
Copy link
Collaborator Author

EthanSteinberg commented Dec 11, 2022

noinline hardly made a difference, but maybe your situation is somehow special?

The situation here is special because of the templates involved. So pulling an even small amount of code out of the template (and forcing it out of the template with non-inline) can have huge savings because of the hundreds of copies.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

Keep in mind that these microbenchmarks are sorta contrived. In order to make the sizes / compile times more significant the same code is copied 719 times.

@Skylion007
Copy link
Collaborator

noinline hardly made a difference, but maybe your situation is somehow special?

The situation here is special because of the templates involved. So pulling an even small amount of code out of the template (and forcing it out of the template with non-inline) can have huge savings because of the hundreds of copies.

Wow ... I don't think I've ever seen such an explosion of binary sizes.

Keep in mind that these microbenchmarks are sorta contrived. In order to make the sizes / compile times more significant the same code is copied 719 times.

This was also the case in the PR I linked here: #2013 they were eventually able to resolve it.

@henryiii
Copy link
Collaborator

Python 3.6 now dropped! Planning on dropping 3.7 immediately after the next release, too. That should make this much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants