-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for free-threaded Python #695
base: master
Are you sure you want to change the base?
Conversation
7001e18
to
67f2752
Compare
Dear @colesbury, following your work on the pybind11 free-threaded support, I was wondering if I could ask two quick questions about getting something similar into nanobind.
Thanks! |
few cents on the last question
@wjakob maybe you have seen this page, otherwise please check this : https://py-free-threading.github.io/running-gil-disabled/ Here is a PR for numpy enabling free-threading tests: https://github.com/numpy/numpy/pull/26463/files HTH |
Hi @wjakob,
|
One more thing: https://docs.python.org/3.13/howto/free-threading-extensions.html#memory-allocation-apis One annoying thing is that I think the other usages can use |
Thank you for the clarification @colesbury.
|
03cb139
to
93a39a7
Compare
@colesbury: I have two more questions about functions that weren't mentioned in PEP 703.
(These pairs are commonly called in existing pybind11/nanobind extensions to mark boundaries between parallel C++ code and previously GIL-protected Python regions.) And an unrelated question about PEP 703: I'm asking about critical sections because I would like to provide a builtin nanobind annotation to protect C++ instances from concurrent access (for unsafe/legacy code). Basically I'm wondering about the right primitive to build on, and whether Thanks! |
Hi @wjakob -
I'll reply about |
The Py_BEGIN_CRITICAL_SECTION(obj);
obj->x++; // first atomically executed
obj->y--; // executed critical section
Py_DECREF(other); /* may call arbitrary code via finalizer */
obj->z *= 2; // second atomically executed critical section
Py_END_CRITICAL_SECTION(); In the above example:
The implicit breakup into multiple smaller atomically executed critical sections occurs at the same places where the GIL would have been released --
If you are dealing mostly with, for example, the C++ STL then plain mutexes are probably appropriate. If you're dealing mostly with the Python C API, then |
Thank you for the clarification @colesbury. That makes a lot of sense. So the new critical sections are really how things have always been with possible code execution through I'm think I'm done with all of the tweaks to nanobind and now wanted to test that it actually works as expected. This led me to a surprising result with v3.13rc1. I wrote a test that launches 8 threads that each construct 1M objects and destroy them afterwards. You can run it as follows with this PR $ python3.13 -m pytest tests/test_thread.py The file contains just a single test: def test01_object_creation():
from test_thread_ext import Counter
def f():
n = 1000000
r = [None]*n
for i in range(n):
r[i] = Counter()
del r
parallelize(f, n_thread=8)
gc.collect() I would expect performance to stay mostly constant until the Interrupting the interpreter via Ctrl-C indicates that this bottleneck seems to be related to the construction of the * frame #0: 0x0000000198dfff88 libsystem_kernel.dylib` swtch_pri + 8
frame #1: 0x0000000198e3ce34 libsystem_pthread.dylib` cthread_yield + 32
frame #2: 0x000000010026aaa0 python3.13` _PyMutex_LockTimed [inlined] _Py_yield + 208 at lock.c:46
frame #3: 0x000000010026aa9c python3.13` _PyMutex_LockTimed(m=0x00000001005f8a88, timeout=-1, flags=_PY_LOCK_DET
ACH) + 204 at lock.c:88
frame #4: 0x000000010026b4c0 python3.13` PyMutex_Lock(m=<unavailable>) + 8 at lock.c:582
frame #5: 0x000000010022e980 python3.13` _PyCriticalSection_BeginSlow [inlined] _PyMutex_Lock(m=0x00000001005f8a
88) + 80 at lock.h:49
frame #6: 0x000000010022e96c python3.13` _PyCriticalSection_BeginSlow(c=0x0000000173e2aac8, m=0x00000001005f8a88
) + 60 at critical_section.c:20
frame #7: 0x000000010013fc5c python3.13` _PyType_LookupRef [inlined] _PyCriticalSection_BeginMutex(c=0x000000017
3e2aac8, m=<unavailable>) + 524 at pycore_critical_section.h:119
frame #8: 0x000000010013fc2c python3.13` _PyType_LookupRef(type=0x0000020000d70010, name=0x00000001005b96f8) +
476 at typeobject.c:5256
frame #9: 0x0000000100152370 python3.13` lookup_maybe_method(self='0x2001f611b40', attr=<unavailable>, unbound=0
x0000000173e2ab7c) + 32 at typeobject.c:2521
frame #10: 0x000000010014d850 python3.13` slot_tp_init [inlined] lookup_method(self='0x2001f611b40', attr=<unava
ilable>, unbound=0x0000000173e2ab7c) + 28 at typeobject.c:2543
frame #11: 0x000000010014d834 python3.13` slot_tp_init(self='0x2001f611b40', args=(), kwds='0x00000000') + 56 a
t typeobject.c:9773
frame #12: 0x00000001001433b0 python3.13` type_call(self='0x20000d70010', args=(), kwds='0x00000000') + 464 at
typeobject.c:1990
frame #13: 0x0000000100096bcc python3.13` _PyObject_MakeTpCall(tstate=0x0000000101019210, callable='0x20000d7001
0', args='0x1009fc270', nargs=4305437280, keywords='0x00000000') + 356 at call.c:242
frame #14: 0x0000000100096838 python3.13` _PyObject_VectorcallTstate(tstate=<unavailable>, callable=<unavailable
>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>) + 228 at pycore_call.h:166
frame #15: 0x0000000100097704 python3.13` PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf
=<unavailable>, kwnames=<unavailable>) + 48 at call.c:327
frame #16: 0x00000001001f9b90 python3.13` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, th
rowflag=0) + 13112 at generated_cases.c.h:813
frame #17: 0x000000010009b1d4 python3.13` _PyObject_VectorcallTstate(tstate=0x0000000101019210, callable='0x2000
0756790', args='0x173e2aed8', nargsf=1, kwnames='0x00000000') + 116 at pycore_call.h:168 Looking at the CPython code, it seems to me that there is a caching scheme in place that should make it possible to avoid the critical section. However, for some reason, this isn't happening. Do you have any ideas what could be wrong? Thank you! |
I think there is a combination of two performance issues: First, instances of The second problem is in the implementation of the MRO type cache in CPython. Basically, the fast-path lookup fails so we fall back to the locking slow path, but we're also not enabling future lookups in the fast-path to succeed. This case only happens with entries that don't use deferred reference counting (like |
Thank you for looking into this @colesbury! For context: nanobind uses a custom function object to implement an efficient function dispatch mechanism (i.e., faster and more flexible than the builtin C function wrappers). It is central to the design of the library—this design choice preventing multi-threaded speedups in free-threaded builds would be bad news. Could you let me know if you see any way to force |
I tried doing the hack myself. Based on the CPython internals, I added the function void enable_deferred(PyObject *op) {
assert(PyType_IS_GC(Py_TYPE(op)) && _Py_IsOwnedByCurrentThread(op) &&
op->ob_ref_shared == 0);
op->ob_gc_bits |= 64;
op->ob_ref_local += 1;
op->ob_ref_shared = _Py_REF_QUEUED;
} (This function is called on a newly created Calling this on the function objects seems to fix the issue with critical sections. However, I'm still not seeing multithreaded speedups. Examining the callstack shows that most of the time is spent dealing with shared refcount atomics. Does this mean that nanobind needs to bite the bullet and make immortal types and function objects? Is it correct to assume that this will break nanobind's leak tracking? For context: one feature of the library is that it complains loudly if any of its objects (classes, functions, instances) are still alive after Python has full shut down and invokes the |
I tried the main branch (v3.14.0) as well immortalizing the functions and still observed significant performance regressions. Then I thought I'd try this in plain Python, and it turns out that the behavior there is similar. So perhaps it's better to understand the situation there. Here is a reproducer: import threading
def parallelize(func, n_threads):
barrier = threading.Barrier(n_threads)
def wrapper():
barrier.wait()
return func()
workers = []
for _ in range(n_threads):
t = threading.Thread(target=wrapper)
t.start()
workers.append(t)
for worker in workers:
worker.join()
def f():
def q():
return 1
for i in range(1000000):
q()
parallelize(f, n_threads=8) Performance on an M1 Max laptop (macOS) on AC power
When I move the
Especially in the former case with a callable declared per thread, I would have expected the runtime cost to be relatively flat until 8 threads (matching the number of performance cores in this machine.) Backtracethis is where time seems to be spent (for the former case with a local function): * thread #2
frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940) + 8 at pyatomic_gcc.h:130
127
128 static inline int
129 _Py_atomic_compare_exchange_ssize(Py_ssize_t *obj, Py_ssize_t *expected, Py_ssize_t desired)
-> 130 { return __atomic_compare_exchange_n(obj, expected, desired, 0,
131 __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); }
132
133 static inline int
warning: python3.14t was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2
* frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940) + 8 at pyatomic_gcc.h:130
frame #1: 0x00000001000c4e70 python3.14t` _Py_DecRefShared [inlined] _Py_DecRefSharedDebug(o=0x000003c380398960, filename=0x0000000000000000, lineno=0) + 28 at object.c:353
frame #2: 0x00000001000c4e54 python3.14t` _Py_DecRefShared(o=0x000003c380398960) + 20 at object.c:371
frame #3: 0x00000001001afed0 python3.14t` _PyEval_FrameClearAndPop + 4 at ceval.c:1684
frame #4: 0x00000001001afecc python3.14t` _PyEval_FrameClearAndPop(tstate=0x000000010081f200, frame=0x00000001006e0290)+ 232 at ceval.c:1710
frame #5: 0x00000001001ab9d4 python3.14t` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=0x00000001006e0218, throwflag=<unavailable>) + 58020 at generated_cases.c.h:6499
frame #6: 0x000000010005e71c python3.14t` method_vectorcall [inlined] _PyObject_VectorcallTstate(tstate=0x000000010081f200, callable=0x000003c3805363c0, args=0x0000000171e12ee8, nargsf=1, kwnames=0x0000000000000000) + 44 at pycore_call.h:167
frame #7: 0x000000010005e6f0 python3.14t` method_vectorcall(method=<unavailable>, args=0x0000000100492b90, nargsf=<unavailable>, kwnames=0x0000000000000000) + 284 at classobject.c:70
frame #8: 0x000000010029a284 python3.14t` thread_run(boot_raw=0x0000600000c00e40) + 128
frame #9: 0x000000010022f2bc python3.14t` pythread_wrapper(arg=<unavailable>) + 28
frame #10: 0x0000000198e3ffa8 libsystem_pthread.dylib` _pthread_start + 148 |
Yes, deferred reference counting isn't fully implemented in 3.13 and is more of a placeholder, so functions and types need to be immortalized.
I'll take a look at this today or tomorrow. |
Hi @wjakob - Here's a patch that demonstrates how to make functions and type objects immortal: https://gist.github.com/colesbury/99e15d904d84d239231f465a5aca6c36 If you are doing multi-threaded benchmarking, you won't want to use
The other change was to move the |
Thank you very much @colesbury, it fixes the performance regression. I am now seeing the following performance on the v3.14 (main) branch:
The cell performance caveat you mention sounds tricky to figure out for users, that would be good to fix indeed. |
Thank you for all your help so far @colesbury. I was wondering about one more thing: In your pybind11 implementation, you sharded the (very frequently accessed) instance map based on the top 44 bits of the pointer address. The rationale was that this region is related the thread ID, and that way we can limit lock contention if every thread mostly works with its own objects. This sounded good to me, and I copied your approach in nanobind. One potential issue is that nanobind extensions talk to each other through shared data structures, and that means that such a sharding scheme will effectively become part of the "ABI" (i.e. difficult to modify after the fact without breaking the ABI contract). It's the same also in pybind11. Hence I want to be sure that this is a good long-term solution. Therefore, I was wondering where this came from, and what can be said about it? A few questions:
Thank you! |
I'm not sure that using the top 44 bits does much to help with lock contention compared to using all the bits. We are still applying a hash function ( To the extent it helps, it's because a given thread is likely to re-use the same shard for lookups (fewer cache misses). The improvement here is small compared to reducing lock contention by increasing the number of shards. There's a good deal of flexibility in terms of how many low bits to discard -- I don't think it's particularly sensitive to the choice or to the allocator. The important things are just that:
The first condition is more important. You definitely don't want to have multiple threads forced to use the same shard. But we have a good deal of slack here. Glibc's arenas are (mostly) per-thread and by default 64 MiB. Mimalloc's segments are per-thread and 32 MiB. If you are concerned about the choice of shift, you can put it in the internals struct (and use that value instead of hardcoding the shift). I think that would give more flexibility about changing it without affecting the ABI. In other words, the creator of the I think more likely paths for improvement are:
|
Hi @colesbury, thank you your the detailed explanation and clarifying my misunderstanding. The PR is now almost done in the sense that all the internal data structures are protected for free-threading. I wanted to ask about your thoughts regarding one more thing: Binding tools like nanobind and pybind11 are often used to bind existing code. However, that code might often not be thread-safe and therefore unsafe to use in free-threaded Python. It's worse than pure Python code because of the possibility of memory-unsafe operations that crash the process. To amelierate such issues, I would like to add a way for nanobind to lock Python objects involved in a function call (without having to modify the function itself). Example: with this feature, the user could change a binding of a function with 3 named arguments
into one guaranteeing that the
In general, this sounds like something that At first, I thought this is simple: I'll just sort all of the objects by pointer address and then run Do you think what I am trying to do is possible in some forward-compatible way? Thanks! |
Hi @wjakob - I don't think there's a way to do what you are proposing. You should avoid using If possible, I'd recommend limiting the implementation to locking one or two arguments so that you can use the critical section APIs. We do something similar in CPython with Argument Clinic: https://github.com/python/cpython/blob/782217f28f0d67916fc3ff82b03b88573686c0e7/Objects/setobject.c#L1998. In our cases:
|
Adapting C++ to handle parallelism due to free-threaded Python can be tricky, especially when the original code is given as-is. This commit an tentative API to retrofit locking onto existing code by locking the arguments of function calls.
This commit documents free-threading in general and in the context of nanobind extensions.
Several parallel tests to check that locking works as expected
fdaddeb
to
5c5fdb1
Compare
I've continued working on the free-threading PR and am now happy with it. Besides argument-level locking, the PR now also provides expanded documentation in I will leave it up for a few days in case someone else wants to take a look (@oremanj, @colesbury, @vfdev-5?) @colesbury, I followed the approach you suggested, which was to only allow <= 2 locks per call and leverage the Python critical section API. |
(Sorry, the above sounded unnecessarily stressful :-) — if anyone wants to review the change, of course you can have more than a few days. It is a quite big change). |
@wjakob thanks for pinging for a review of this PR! |
The issue you referenced won't occur with nanobind, which uses a more fine-grained locking policy (as opposed to a global mutex for |
To opt into free-threaded Python, pass the ``FREE_THREADED`` parameter to the | ||
:cmake:command:`nanobind_add_module()` CMake target command. For other build | ||
systems, refer to their respective documentation pages. | ||
|
||
.. code-block:: cmake | ||
nanobind_add_module( | ||
my_ext # Target name | ||
FREE_THREADED # Opt into free-threading | ||
my_ext.h # Source code files below | ||
my_ext.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I'm not very sure to understand the Opt-in mechanism. We can opt into free-threading mode using CMake only ? For other build systems we need to explicitly define NB_FREE_THREADED
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cmake is the official build system of this project. You need to opt-in from there because this also changes how nanobind's add-on library is compiled. There are also a separate Bazel and Meson build systems maintained by @nicholasjng and @WillAyd. The authors of these will need to update their code with a minor adaptation, but that is not part of this PR.
When free threading is available and requested, the build sytem will automatically set NB_FREE_THREADED
when compiling the C++ extension code. The user doesn't need to use this, but it could be useful in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, I'm tracking this in nicholasjng/nanobind-bazel#37 for Bazel. Contributions welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meson >= 1.5.0 natively supports building free-threaded extensions from a free-threaded interpreter. In theory, the only thing we will have to do with that configuration is add the NB_FREE_THREADED
define when that type of interpreter is detected, although if you wanted to experiment early you can try updating your wrap configuration to point to head here and manually define the macro:
meson setup builddir -Dcpp_args="$CXXFLAGS -DNB_FREE_THREADED"
or
pip install <your_extension> -Csetup-args='-Dcpp_args="$CXXFLAGS -DNB_FREE_THREADED'
if using the pip front-end
I read through the PR and it looks good to me, although I don't have a lot of familiarity with nanobind. |
Thank you @colesbury. Would you be open to adding an offical API call for the immortalization step? I'm concerned that having this fragment (https://github.com/wjakob/nanobind/pull/695/files#diff-6de720f0a86306e2563f4f4606136d14fb439bad73da2fb7641029d0e37e6d95R1181) be a part of nanobind is not a good long-term solution. |
@wjakob - yes, I would like it as part of the C API. We will need to propose it to the C API Working Group: https://github.com/capi-workgroup/decisions/issues I think we will want at least two functions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
@@ -184,6 +188,50 @@ char *strdup_check(const char *s) { | |||
return result; | |||
} | |||
|
|||
#if defined(NB_FREE_THREADED) | |||
// Locked function wrapper for free-threaded Python | |||
struct locked_func { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how often locking is likely to be used in practice, but it seems like we don't strictly need to pay the cost of the additional indirection that's currently needed here because of type-erasure of the locked function. The alternative would look something like:
- add types
arg_locked
andarg_v_locked
so that the information on whether an argument is [potentially] locked can be carried in the type rather than just the value;arg::lock()
returns anarg_locked
, etc - make
func_create
innb_func.h
notice when it receives somearg_locked
,arg_v_locked
, orself_locked
arguments, and conditionally instantiate aft_object2_guard
in itsimpl
lambda (the most natural choice would be to usestd::optional
, but a custom RAII guard would also work if you don't want to take that header dependency)
That would also let us statically assert that at most 2 arguments can be locked, rather than relying on a dynamic check; and I think it might allow removing cast_flags::locked
as well, which would make it feasible to allow using the simple function dispatcher even with locked args (if the user passes only empty nb::arg() params with no changes to none/convert/name/default). Up to you whether you think it's worth the trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make func_create in nb_func.h notice when it receives some arg_locked, arg_v_locked, or self_locked arguments, and conditionally instantiate a ft_object2_guard in its impl lambda (the most natural choice would be to use std::optional, but a custom RAII guard would also work if you don't want to take that header dependency)
This sounds really intriguing. What role does std::optional
play? How would you conditionally insert the ft_object2_guard
with templating, and how would it know which arguments to fetch? I had not considered this could be done in the compile-time part and don't see a reasonably simple solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m on vacation until Sunday but will try to prototype this when I’m back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be amazing, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I wanted to add: I fear the meta-templating needed to pull this off might push things above a complexity threshold that I would be willing to accept. Right now, the solution has a small runtime cost, but users only need to pay that cost when using the locking feature. It's not perfect but an OK solution. Basically: don't spend a lot of time on it if you feel like the solution will turn into a template monstrosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at #720 which implements this. I think it came out pretty reasonable as far as complexity goes.
type_data *nb_type_c2p(nb_internals *internals_, | ||
const std::type_info *type) { | ||
#if defined(NB_FREE_THREADED) | ||
thread_local nb_type_map_fast type_c2p_fast; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places use static NB_THREAD_LOCAL
for thread-locals, should this one normalize that way?
To check my understanding - this means each nanobind module and each thread will have its own copy of type_c2p_fast
, right? Want to make sure there's no risk of inadvertent sharing between different nanobind modules (with potentially different ABI versions) that are running in the same thread. I think it's fine because this definition at local scope wouldn't have linkage, but sometimes dynamic linking semantics surprise me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the function-local declaration anonymizes this symbol. In free-threaded mode, this variable plays the role of a local TLS cache to accelerate lookups.
Each module will have its own cache, potentially with separate entries depending on what types it looks up while dispatching function calls.
There is no inherent reason that this cache couldn't also be shared between all of the ABI-compatible extensions (still per-thread). Indeed, this is the role of the internals
data structure that enables such sharing while being cautious of ABI incompatibilities. In principle, it should be possible to place the TLS variable there. However, non-static thread_local
members are not allowed in C++ class declarations, which is why I chose this alternative. If you have other ideas, I'm open to them.
Python exposes its own TLS wrappers that seem convenient and would enable putting TLS data into internals
(see the Py_tss_t *nb_static_property_disabled
member of nb_internals
for an example). However, I did not see an easy way to guarantee that the data structure destructor is called when the thread shuts down, which thread_local
will do at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the C++ thread-local seems to be the best choice here.
|
||
#if defined(NB_FREE_THREADED) | ||
// In free-threaded mode, stale type information remains in the | ||
// 'type_c2p_fast' TLS. It is freed when the thread terminates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIsn't this potentially a memory safety hazard? If one drops all references to a nanobind type (such as by del'ing it from its module and not having any active instances or subclasses of it), it shouldn't be possible for later attempts to convert its C++ type to Python to rely on a now-dangling pointer. I think this could be fixed by moving the thread-local map into a separate function that both nb_type_c2p
and nb_type_unregister
can call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that one thread can unregister a type, but many other threads could have that type in their caches. So you would need to delete it there as well, which requires synchronization. However, the point of this construction is to avoid synchronization.
The comment here is missing an important detail: in free-threaded mode, we immortalize type objects in type_new()
and let them leak. This means that this function will only ever be called to delete the type_data*
of C++ types that are subclassed in Python, and these don't have a unique entry in type_c2p_fast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related question would be what to do if we don't immortalize type objects at some point in the future (which is unfortunate because of the leaks). Is there a way of having fast unlooked lookups while catching the case you mention.
The only think I can think of is that there is some kind of global version counter incremented in nb_type_unregister
, which is also cached in the TLS. If the version is different, the cache must be cleared. This would work most of the time but is still prone to a race when one thread is unregistering a type, while the other one is currently performing a lookup in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is safe now because we immortalize. It would be nice to mention that in the comment here.
As far as the potentially non-immortal future goes: one useful trick here is to increment the counter once before you remove the type from the c2p_slow map, and once again afterward. If lookup finds the counter at an odd value, it must retry. I think this is safe as long as you use an atomic value for the counter, guard unregistration with the internals mutex so multiple threads can't unregister simultaneously, and use acquire ordering for the first increment and release for the second.
I realized one more thing: the So for now the guidance is for the user to add a |
Thank you very much for the thorough feedback @oremanj. |
{ | ||
lock_internals guard(internals_); | ||
nb_ptr_map::iterator it; | ||
bool success; | ||
std::tie(it, success) = internals_->funcs.try_emplace(func, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you no longer need the additional scope for the lock guard's lifetime, I think you could switch this back to the auto [it, success] = ...
at top-level that it was before the FT changes.
The new C++20 |
This PR adapts nanobind to run on free-threaded Python builds. The changes involve:
A CMake parameter
nanobind_add_module(... FREE_THREADED ...)
to opt into free threaded behavior. It's a no-op in GIL-protected Python.Adapated internal data structures to protect them from concurrent modification.
Protected access to the high-bandwidth CPP->Python hash table (
internals.inst_c2p
) and and keep-alive lists data structure (internals.keep_alive
). This uses sharding based on pybind11 PR #5148 by @colesbury to reduce lock contention.internals.type_c2p_fast
turns into a fast thread-local cache around a slower globally lockedinternals.type_c2p_slow
. (These are hash tables that map from C++ to Python types, and they are accessed relatively often).Ordinary locking for the remaining fields that aren't on the critical path.
The design avoids centralized locks as much as possible to improve scalability (for example, pybind11 locks to the
internals
data structure several times per dispatched call, which is costly and seems somewhat counter to the design of the free-threaded Python implementation with its focus on local locks). A documentation string added tosrc/nb_internals.h
provides a detailed explanation of the locking semantics of every individual field.Threading awareness for list accessors and sequence type casters.
RAII wrappers for Python mutexes and critical sections.
Documentation explaining free-threading in general, and in the context of nanbind
Parallel tests that exercise various critical sections
The plan is to merge this into the upcoming nanobind release v2.2.0. Three related points: