Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support free-threaded CPython with GIL disabled #5148

Merged
merged 10 commits into from
Jun 18, 2024

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented May 29, 2024

Description

Accesses to the internals struct now goes through a with_internals([] { ... }); statement. In the free-threaded build, the internals mutex is locked for the duration of the access.

The registered_instances map has its own locking scheme because it may be frequently accesses by multiple thread. The map is shareded by address and a lock per shard is used to increase concurrency.

Other smaller changes:

  • Add gil_not_used() tag to indicate that a module supports running with the GIL disabled. See PyUnstable_Module_SetGIL.
  • Use PyMem_MALLOC for tp_doc in 3.13+
  • Add dict_getitemstringref to wrap new PyDict_GetItemStringRef API

Suggested changelog entry:

* Support free-threaded CPython (3.13t)
* Add ``py::mod_gil_not_used()`` tag to indicate if a module supports running with the GIL disabled.

@colesbury colesbury requested a review from henryiii as a code owner May 29, 2024 18:49
@henryiii henryiii requested review from rwgk and wjakob May 29, 2024 21:04
Some additional locking is added in the free-threaded build when
`Py_GIL_DISABLED` is defined:

- Most accesses to internals are protected by a single mutex
- The registered_instances uses a striped lock to improve concurrency

Pybind11 modules can indicate support for running with the GIL disabled
by calling `set_gil_not_used()`.
include/pybind11/detail/internals.h Outdated Show resolved Hide resolved
include/pybind11/detail/internals.h Outdated Show resolved Hide resolved
include/pybind11/detail/internals.h Show resolved Hide resolved
include/pybind11/detail/internals.h Outdated Show resolved Hide resolved
include/pybind11/pybind11.h Outdated Show resolved Hide resolved
include/pybind11/detail/internals.h Show resolved Hide resolved
include/pybind11/detail/internals.h Show resolved Hide resolved
@@ -542,6 +577,14 @@ PYBIND11_NOINLINE internals &get_internals() {
internals_ptr->static_property_type = make_static_property_type();
internals_ptr->default_metaclass = make_default_metaclass();
internals_ptr->instance_base = make_object_base_type(internals_ptr->default_metaclass);
#ifdef Py_GIL_DISABLED
size_t num_shards = (size_t) next_pow2(2 * std::thread::hardware_concurrency());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

auto num_shards = static_cast<size_t>(...);

?

(The main motivation is to minimize C-style casts. Similarly in a few other places changed in this PR.)

Could it be worth adding a comment to explain why 2 * ...?

include/pybind11/detail/internals.h Show resolved Hide resolved
include/pybind11/detail/internals.h Show resolved Hide resolved
auto &internals = get_internals();

#ifdef Py_GIL_DISABLED
// Hash address to compute shard, but ignore low bits. We'd like allocations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to move this comment to the splitmix64 function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is specific to this section of code: it's mostly about why we are ignoring the low bits (addr >> 20) when hashing the address.

// other threads/cores to map to other shards. Using the high bits is a good
// heuristic because memory allocators often have a per-thread
// arena/superblock/segment from which smaller allocations are served.
auto addr = reinterpret_cast<uintptr_t>(ptr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast seems to work here (with Linux gcc at least). (I see we're already using reinterpret_cast in a bunch of similar situations, but maybe that isn't ideal?)

Would using std::uintptr_t and std::uint64_t be slightly better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast here causes a compiler error for me: https://gcc.godbolt.org/z/bTnosGf61

I've updated the other casts and used the std namespace.

include/pybind11/detail/internals.h Show resolved Hide resolved
@@ -1199,6 +1206,8 @@ struct handle_type_name<cpp_function> {

PYBIND11_NAMESPACE_END(detail)

struct gil_not_used {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the alternative API here:
rwgk@a1c0f8e

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how much simpler it is. I'm not sure about the mod; if we put anything, it should have the full module, we don't use mod elsewhere for Python, and it isn't clearly module. But it can't be misplaced, and we don't "namespace" anything else like it.

I'm not sure it would actually be any simpler if we added a second option later (like some sort of multiphase init option). But maybe it's better to simplify now rather than worry too much about a future (which would not be more complex by this style either, at least).

py::free_threaded() seems like it would match the naming of CPython better, though most of the CPython internals related to the feature seem to refer to the GIL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how much simpler it is.

It's simpler only here, but my suggested alternative is simpler overall, because it avoids the overloads and the private implementation of the create_extension_module function. For the user it doesn't make a difference.

My suggestion includes the mod_ so that it comes out very similar to the existing name: Py_MOD_GIL_NOT_USED vs. py::mod_gil_not_used().

Another thought was that it's very easy to experiment, e.g. if

PYBIND11_MODULE(external_module, m, py::mod_gil_not_used())

has some issues that are difficult to resolve in the short term, it can be changed to

PYBIND11_MODULE(external_module, m, py::mod_gil_not_used(false))

and it'll be very obvious that free-threading support was attempted, and how to start trying again. It'll also be very easy to make the free-threading support platform-specific this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my suggested alternative is simpler overall, because it avoids the overloads and the private implementation

That's what I meant by simpler.

Copy link
Collaborator

@henryiii henryiii Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also called PyUnstable_Module_SetGIL, which does have the module in it. But I hadn't noticed how that mimics the slot value name. Personally still like the more readable, less cryptic names, but happy with whatever we go with.

And py::free_threaded(false) avoids the double negative. ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant by simpler.

Oh, got it. Thanks. (The way it appeared in the email message got me on the wrong track.)

And py::free_threaded(false) avoids the double negative.

That works for me, too. — Usually I strive to not introduce new names if there is one already, but the double negative bothered me, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to have @colesbury have final say on what name he prefers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards matching the name from the Python C API because I think using the same names helps developers when they switch between the C API and pybind11.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great to me, except the question about adopting the simplification under rwgk@a1c0f8e. @colesbury What's your opinion about that?

Copy link
Contributor Author

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @rwgk. I've added some comments and renamed a few functions.

include/pybind11/detail/internals.h Show resolved Hide resolved
include/pybind11/detail/internals.h Show resolved Hide resolved
auto &internals = get_internals();

#ifdef Py_GIL_DISABLED
// Hash address to compute shard, but ignore low bits. We'd like allocations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is specific to this section of code: it's mostly about why we are ignoring the low bits (addr >> 20) when hashing the address.

// other threads/cores to map to other shards. Using the high bits is a good
// heuristic because memory allocators often have a per-thread
// arena/superblock/segment from which smaller allocations are served.
auto addr = reinterpret_cast<uintptr_t>(ptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast here causes a compiler error for me: https://gcc.godbolt.org/z/bTnosGf61

I've updated the other casts and used the std namespace.

include/pybind11/detail/internals.h Show resolved Hide resolved
@@ -1199,6 +1206,8 @@ struct handle_type_name<cpp_function> {

PYBIND11_NAMESPACE_END(detail)

struct gil_not_used {};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards matching the name from the Python C API because I think using the same names helps developers when they switch between the C API and pybind11.

Copy link
Collaborator

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury If you could let us know if you like @rwgk's variation (I don't mind applying it if you do), otherwise it looks good to me.

@henryiii
Copy link
Collaborator

Changelog in the description is a little out of date, too.

@colesbury
Copy link
Contributor Author

Yes, please apply it

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colesbury for transporting pybind11 into the free-threading world :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants