-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Early stage prototype] Optimize calls using vectorcall, 35% reduction in call overhead #4250
Conversation
3.6 is already ~10 months past EOL (https://endoflife.date/python). This PR is a good motivation to drop it. You could simply remove the corresponding CI jobs in .github/workflows/ci.yml I only glanced through the code. At first sight it looks like a relatively small change, given what it does. If you want to drive this to completion, I'll look more, run sanitizers & global testing when you get the CI (minus 3.6) healthy here. Under the smart_holder branch I already introduced a ubench (micro benchmark) directory. Could you please use the same name here, too? Actually, I think it would make sense to copy over the code and strip out the very few lines for smart_holder, to be added back in under the smart_holder branch (I could do that part). You could also remove the python/number_bucket.clif file. |
@lalaland If you want to optimize function dispatch, you will want to take a look at this PR too: https://github.com/pybind/pybind11/pull/2013/files/0c8a13d27b7f36dbcaf91833252a5788c0ed429d. It's pretty stale, but it would be great if we could also somehow reduce the number of allocations by using std::array, std::bitset, etc as done in the above PR and get a further cumulative speedup. If that PR was up to date with master, I would gladly merge it along with this one. One of the reasons @wjakob created https://github.com/wjakob/nanobind/ was to have full supported vectorcall dispatch even if that required the newest versions of Python. I would definitely onboard with this PR even if we have to drop 3.6 support or if it causes minor binary size increase. We do need some benchmarks that show it's faster (which the linked PR also presented). |
Lalaland, to try shorten the commit log a bit for this PR, please do the following: pip install pre-commit
# cd to git project root
pre-commit install That will run pre-commit locally before every commit. |
Thanks for the feedback. I'll push this PR to completion. I highly recommend both of you turn off notifications for this PR as I'll probably have to spam the CI quite a bit to get this in a working shape on all platforms. @Skylion007 I purposely don't run pre-commit to trigger a CI run. I don't have CI permissions, but the precommit bot does so if I can trigger the precommit bot, I get a nice CI run. |
@lalaland, Ah, once your other PR is merged that shouldn't be an issue anymore. (You should get permission if you have a merged commit). |
@lalaland You should now have CI permissions. |
3.8 added vectorcall. That’s why nanobind requires 3.8+. Totally okay to drop 3.6 any time, but I think we’d need to drop 3.7 too to get vectorcall, which I'm not okay with yet. |
717595f
to
e5e84e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of nits.
As I mentioned before, a lot of the slowdown in master is also from unnecessary memory allocations (std::vector, std::string, etc...) be wary of that as you build out this PR and take a look at the one I linked.
Also be wary of unused template args / such as we are a header only lib so we do want to be careful about slowdown/bloat from unnecessary template instationation. Of course it's fine if those template args are actually used (like for converting an std::vector into an std::array for n_args.
Also you only need the MSVC warning override around stuff that should be constexpr. Unfortunately, there is no easy way to override it in MSVC as the recommended solution is to use constexpr if etc, but that only works if you are > C++14 of course.
|
||
#else | ||
|
||
constexpr auto align = std::align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr auto align = std::align; | |
using align = std::align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to work? Are you sure this is the correct syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, yeah that's the wrong syntax.
constexpr auto align = std::align; | |
using std::align; |
See
pybind11/include/pybind11/detail/common.h
Line 637 in 09db644
using std::conditional_t; |
module = std::string(get_child(0).module); | ||
name = std::string(get_child(0).name); | ||
current_scope = get_child(0).current_scope; | ||
has_operator = get_child(0).has_operator; | ||
is_constructor = get_child(0).is_constructor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_child(0) should be cached. It's weird to cache all these members from the child anyway. If you are going to do this instead of caching the child itself, then maybe have a setter that takes in a child object.
template <typename T> | ||
constexpr T constexpr_min(T a, T b) { | ||
return a > b ? b : a; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do a C++ version guard with this and do a using alias
std::min if C++ >= 14. Some of these details might better level in detail/common.h .
std::string name; | ||
std::string desc; | ||
object value; ///< Associated Python object | ||
bool convert : 1; ///< True if the argument is allowed to convert when loading | ||
bool none : 1; ///< True if None is allowed when loading | ||
|
||
argument_record() : name(""), desc(""), value(), convert(true), none(true) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string name; | |
std::string desc; | |
object value; ///< Associated Python object | |
bool convert : 1; ///< True if the argument is allowed to convert when loading | |
bool none : 1; ///< True if None is allowed when loading | |
argument_record() : name(""), desc(""), value(), convert(true), none(true) {} | |
std::string name = ""; | |
std::string desc = ""; | |
object value(); ///< Associated Python object | |
bool convert : 1 = true; ///< True if the argument is allowed to convert when loading | |
bool none : 1 = true; ///< True if None is allowed when loading | |
argument_record() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bitfields mean that this doesn't work.
warning: default member initializers for bit-fields only available with '-std=c++2a' or '-std=gnu++2a'
}; | ||
|
||
template <> | ||
struct process_attribute<buffer_protocol> : process_attribute_default<buffer_protocol> { | ||
static void init(const buffer_protocol &, type_record *r) { r->buffer_protocol = true; } | ||
template <typename RecordType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, these are a lot of template types that are just ignored. Is there a more efficient way to do this that doesn't cause a lot of template initialization? Like making the 2nd arg a void pointer or something?
if (space < size + padding) | ||
return nullptr; | ||
space -= padding; | ||
return ptr = reinterpret_cast<void *>(aligned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ptr = reinterpret_cast<void *>(aligned); | |
return ptr = static_cast<void *>(aligned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't support on centos7 for some reason.
61c7699
to
3d1d32e
Compare
(FYI, if it's useful, I also extracted and polished https://github.com/jblespiau/pybind11_benchmarks from the nanobind repository, in hope to bring within pybind11 a more extensive benchmarking suite.) |
commit 61c7699 Author: Ethan Steinberg <[email protected]> Date: Mon Nov 21 04:30:15 2022 +0000 test it commit 1e6b024 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 07:25:13 2022 +0000 Shoot, forgot to disable test commit e97060a Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 07:14:19 2022 +0000 Final run before I break out the VMs commit 1c0fb6d Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 06:46:54 2022 +0000 Fix a bunch commit 738fc7c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 06:14:27 2022 +0000 Closer commit 3c3b476 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:54:25 2022 +0000 Silence commit 794bcf4 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:35:41 2022 +0000 Fix regex commit 759e6a6 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:26:33 2022 +0000 OOps commit e9d7f8a Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 05:20:31 2022 +0000 Next commit 0cef232 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:58:47 2022 +0000 Fix more warnings commit 1d20876 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:48:28 2022 +0000 Fix warnings commit 81a6f82 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 04:38:56 2022 +0000 Fix issues commit 6d44ec9 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:23:42 2022 +0000 Force it to work commit 359742c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:18:09 2022 +0000 Fix scope issue commit 958441c Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:10:48 2022 +0000 Oops, more commit 42000d9 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 01:01:55 2022 +0000 Fix more warnings commit 5357d35 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:40:04 2022 +0000 Continue fixing ci commit e38960d Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon Oct 24 00:22:37 2022 +0000 style: pre-commit fixes commit e5e84e5 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:21:59 2022 +0000 Switch to other optional commit 553fd85 Author: Ethan Steinberg <[email protected]> Date: Mon Oct 24 00:03:48 2022 +0000 Temp commit 1ee063f Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 22:47:08 2022 +0000 Next commit c62200c Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 17:28:29 2022 +0000 Temp commit 7cd12aa Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 13:04:28 2022 +0000 Store commit e65c573 Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 12:27:28 2022 +0000 Temp commit b2fe2f0 Author: Ethan Steinberg <[email protected]> Date: Sun Oct 23 01:25:03 2022 +0000 Reset commit 86e6b88 Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 23:58:44 2022 +0000 Next commit ee8010b Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 21:11:09 2022 +0000 Temp commit a2caf9f Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 04:25:54 2022 +0000 Next commit feeecc7 Author: Ethan Steinberg <[email protected]> Date: Sat Oct 22 03:06:48 2022 +0000 First commit Add benchmark Temp Next What CSontinue Next Temp style: pre-commit fixes
d930755
to
de8202e
Compare
Looks like this is very close to v1 being ready. Passes all the tests on regular Python and is quite fast to boot. The main issue is that I need to do a lot more space optimization Using @jblespiau's test harness Time: The problem is space |
What exactly is 4 times bigger? Sorry I don't have a lot of experience with micro benchmarks and binary size comparisons. |
The .so files generated when the extensions are complied. Fixing this should be quite easy. I just need to create non inlined helper functions in several places. |
Wow ... I don't think I've ever seen such an explosion of binary sizes.
Hm ... I did measurements back in Aug 2021: https://docs.google.com/spreadsheets/d/1bkg1HtmU1AjotPTXCCLSbTBGARRet8W91noh1DnKOHU/edit#gid=0 noinline hardly made a difference, but maybe your situation is somehow special? |
The situation here is special because of the templates involved. So pulling an even small amount of code out of the template (and forcing it out of the template with non-inline) can have huge savings because of the hundreds of copies.
Keep in mind that these microbenchmarks are sorta contrived. In order to make the sizes / compile times more significant the same code is copied 719 times. |
This was also the case in the PR I linked here: #2013 they were eventually able to resolve it. |
Python 3.6 now dropped! Planning on dropping 3.7 immediately after the next release, too. That should make this much easier. |
This PR is an early stage prototype of using Python 3.7's vectorcall support to reduce function call overhead in pybind11.
This is very early stage, it doesn't support all of pybind11's features (kwargs don't quite work) and is in really rough shape code / documentation wise.
I am using a simple benchmark, included in this PR under the folder pybench, to measure performance.
Time to call a C++ function 100000000 times:
Raw CPython API: 5.2974734380841255 seconds
Pybind11 master Branch: 17.06675560772419 seconds
Pybind11 this PR: 11.004007823765278 seconds
I believe I can get this down even further (to around 8 seconds or so) by optimizing the loader_life_support.
The main downsides of this approach:
We are storing temporary data in function_record objects, so we can't support multiple threads calling the same function_record at a time. This shouldn't matter though as that's always the case due to the GIL
For older versions of Python (<= 3.6), I will have to implement a fallback that will have a tiny drop in performance due to an extra python tuple and python dict allocation and construction.
What are people's thoughts? Should I fully flush out this PR and add full support for the remaining pybind11 functionality? Do people think this is a useful optimization to add to pybind11?
CC @rwgk