Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for free-threaded Python #695

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Support for free-threaded Python #695

wants to merge 19 commits into from

Conversation

wjakob
Copy link
Owner

@wjakob wjakob commented Aug 21, 2024

This PR adapts nanobind to run on free-threaded Python builds. The changes involve:

  • A CMake parameter nanobind_add_module(... FREE_THREADED ...) to opt into free threaded behavior. It's a no-op in GIL-protected Python.

  • Adapated internal data structures to protect them from concurrent modification.

    1. Protected access to the high-bandwidth CPP->Python hash table (internals.inst_c2p) and and keep-alive lists data structure (internals.keep_alive). This uses sharding based on pybind11 PR #5148 by @colesbury to reduce lock contention.

    2. internals.type_c2p_fast turns into a fast thread-local cache around a slower globally locked internals.type_c2p_slow. (These are hash tables that map from C++ to Python types, and they are accessed relatively often).

    3. Ordinary locking for the remaining fields that aren't on the critical path.

    The design avoids centralized locks as much as possible to improve scalability (for example, pybind11 locks to the internals data structure several times per dispatched call, which is costly and seems somewhat counter to the design of the free-threaded Python implementation with its focus on local locks). A documentation string added to src/nb_internals.h provides a detailed explanation of the locking semantics of every individual field.

  • Threading awareness for list accessors and sequence type casters.

  • RAII wrappers for Python mutexes and critical sections.

  • Documentation explaining free-threading in general, and in the context of nanbind

  • Parallel tests that exercise various critical sections

The plan is to merge this into the upcoming nanobind release v2.2.0. Three related points:

  • Performance: although some internals were restructured, there is no impact on performance in non-free-threaded code. Optional features such as per-argument locking in free-threaded bindings are designed so that they don't add a runtime cost if not used.
  • API stability: no incompatibilities.
  • ABI stability: free-threading involves changes in internal data structures, requiring an ABI version bump. The upcoming version 2.2.0 already increments the ABI version.

@wjakob wjakob force-pushed the free-threaded branch 3 times, most recently from 7001e18 to 67f2752 Compare August 21, 2024 09:20
@wjakob
Copy link
Owner Author

wjakob commented Aug 21, 2024

Dear @colesbury,

following your work on the pybind11 free-threaded support, I was wondering if I could ask two quick questions about getting something similar into nanobind.

  1. Does the free-threaded version of Python provide a fast/portable way of getting a core and/or thread ID for sharding data structures? This is for situations where an existing pointer address isn't available that could be used to segregate a lookup.

  2. The failing test in this PR happens only in free-threaded builds, and I suspect that it is a regression in GC behavior. More details here: https://discuss.python.org/t/reference-leaks-in-free-threaded-3-13-version/61467/3. Do you have any ideas what might be going wrong?

  3. Is it necessary to set the PYTHON_GIL environment variable to 0 to get free-threaded behavior? Or is it also okay if the variable isn't set.

Thanks!

@vfdev-5
Copy link

vfdev-5 commented Aug 21, 2024

few cents on the last question

Is it necessary to set the PYTHON_GIL environment variable to 0 to get free-threaded behavior? Or is it also okay if the variable isn't set.

@wjakob maybe you have seen this page, otherwise please check this : https://py-free-threading.github.io/running-gil-disabled/

Here is a PR for numpy enabling free-threading tests: https://github.com/numpy/numpy/pull/26463/files

HTH

@colesbury
Copy link

Hi @wjakob,

  1. There are a few possible APIs depending on the use case. PyThreadState_Get() returns a unique address per-thread. _Py_ThreadId() is probably a bit faster, but it's not a public API. If you go that route, I'd suggest copy-pasting the implementation rather than calling _Py_ThreadId() directly, but that's up to you. If you need small consecutive integers, you are probably best off with a thread_local variable.

  2. I replied to the forum post: https://discuss.python.org/t/reference-leaks-in-free-threaded-3-13-version/61467/7

  3. If you every C extension calls PyUnstable_Module_SetGIL(m, Py_MOD_GIL_NOT_USED) then PYTHON_GIL=0 is not necessary. (The environment variable overrides what the extensions specify). In addition to @vfdev-5's link, there are some more details in the free-threading howto guide. If some PyUnstable_Module_SetGIL are missing and PYTHON_GIL=0 is not set, then you'll see a warning about the GIL being enabled printed in the tests. I don't see that warning in the test output, so I don't think you need to set the environment variable.

@colesbury
Copy link

One more thing: PyObject_Malloc() now should only be used for allocating PyObject* (and not for PyObject** or other structs). I think some or all of the current PyObject_Malloc() usages should be PyMem_Malloc() (and PyObject_Free -> PyMem_Free().

https://docs.python.org/3.13/howto/free-threading-extensions.html#memory-allocation-apis

One annoying thing is that tp_doc needs to be PyObject_Malloc'd for 3.12, but PyMem_Malloc'd for 3.13+.

I think the other usages can use PyMem_Malloc unconditionally.

@wjakob
Copy link
Owner Author

wjakob commented Aug 22, 2024

Thank you for the clarification @colesbury.

  1. nanobind uses the PyType_FromMetaclass API on v3.12+ and doesn't allocate tp_docs, so that should be okay.

  2. Do you know if there is any difference between PyMem_Malloc and PyObject_Malloc for the other uses pre-v3.13? (If there is no difference, I would switch all of them over instead of making a version-dependent change).

    In general, the reason for using the Python allocation API for non-Python objects was to leverage its slab allocator, which has generally been faster than a C-style malloc() or C++ new. That would be nice to preserve when making a change here.

  3. The issue with the leak seems more complex, I responded on the Python forum.

@wjakob wjakob force-pushed the free-threaded branch 2 times, most recently from 03cb139 to 93a39a7 Compare August 22, 2024 07:24
@wjakob
Copy link
Owner Author

wjakob commented Aug 22, 2024

@colesbury: I have two more questions about functions that weren't mentioned in PEP 703.

  • are PyGILState_Ensure() and PyGILState_Release() no-ops in free-threaded builds? Although the GIL is gone, it seems to me that these at least still compile to function calls when looking at the CPython headers, so there is presumably some cost even if their implementations don't do anything. Or do they do something? Should I be careful to #ifdef calls to these functions when building free-threaded extensions?

  • same question, but about the opposite direction: what do PyEval_SaveThread() and PyEval_RestoreThread() in free-threaded builds.

(These pairs are commonly called in existing pybind11/nanobind extensions to mark boundaries between parallel C++ code and previously GIL-protected Python regions.)

And an unrelated question about PEP 703: Py_BEGIN_CRITICAL_SECTION can release prior locks held by the thread to avoid deadlocks. I'm wondering about the semantics of a critical section in this case. A traditional critical section executes atomically, but this approach seems inherently non-atomic. What is the purpose of such a relaxed critical section, and what assumptions can still be made about it?

I'm asking about critical sections because I would like to provide a builtin nanobind annotation to protect C++ instances from concurrent access (for unsafe/legacy code). Basically I'm wondering about the right primitive to build on, and whether Py_BEGIN_CRITICAL_SECTION provides enough guarantees.

Thanks!

@colesbury
Copy link

Hi @wjakob -

  • There's not a substantial difference between PyMem_Malloc and PyObject_Malloc pre-3.13. They have the same performance characteristics so you can switch all of them over instead of making a version-dependent change.
  • PyGILState_Ensure and related APIs, including PyEval_SaveThread, are still required in the same places. They're not no-ops even when they no longer acquire/release a global interpreter lock. At a high level, you can think of them as managing thread states (i.e., PyThreadState). Threads are still required to have a valid, active thread state when calling most Python C APIs. More concretely, one of the major purposes is to synchronize with the garbage collector, which can't run while threads are mutating Python objects.

I'll reply about Py_BEGIN_CRITICAL_SECTION in a follow up comment.

@colesbury
Copy link

The Py_BEGIN/END_CRITICAL_SECTION() pairs may (implicitly) define multiple atomically executed critical sections. For example:

Py_BEGIN_CRITICAL_SECTION(obj);
obj->x++;    // first atomically executed
obj->y--;    //  executed critical section
Py_DECREF(other); /* may call arbitrary code via finalizer */
obj->z *= 2; // second atomically executed critical section
Py_END_CRITICAL_SECTION();

In the above example:

  • Py_BEGIN_CRITICAL_SECTION(obj) acquires the per-object lock on obj.
  • obj->x++ and obj->y-- together execute atomically like in a traditional critical section
  • Py_DECREF(other) may internally temporarily release the per-object lock because finalizers can usually call arbitrary code. This can happen if it calls PyEval_SaveThread() or begins a new internal critical section (which might block). Importantly, when Py_DECREF returns, the thread is guaranteed to hold the lock for obj. It's just not guaranteed to have held it continuously.
  • obj->z *= 2 then executes atomically
  • Py_END_CRITICAL_SECTION() releases the lock on obj.

The implicit breakup into multiple smaller atomically executed critical sections occurs at the same places where the GIL would have been released -- Py_DECREF() in this case.

Py_BEGIN_CRITICAL_SECTION is less straight-forward than a plain mutex, but often easier to use when dealing with the Python C API because:

  1. It's semantics are similar to the GIL regarding which places it may be implicitly, temporarily released
  2. Many Python C API calls including the ubiquitous Py_DECREF() may call arbitrary code so using plain mutexes often requires substantial refactoring or risks deadlocks.

If you are dealing mostly with, for example, the C++ STL then plain mutexes are probably appropriate. If you're dealing mostly with the Python C API, then Py_BEGIN_CRITICAL_SECTION is probably easier to use.

@wjakob
Copy link
Owner Author

wjakob commented Aug 23, 2024

Thank you for the clarification @colesbury. That makes a lot of sense. So the new critical sections are really how things have always been with possible code execution through Py_DECREF() or even context switches to other threads while holding the GIL.

I'm think I'm done with all of the tweaks to nanobind and now wanted to test that it actually works as expected. This led me to a surprising result with v3.13rc1. I wrote a test that launches 8 threads that each construct 1M objects and destroy them afterwards.

You can run it as follows with this PR

$ python3.13 -m pytest  tests/test_thread.py

The file contains just a single test:

def test01_object_creation():
    from test_thread_ext import Counter

    def f():
        n = 1000000
        r = [None]*n
        for i in range(n):
            r[i] = Counter()
        del r

    parallelize(f, n_thread=8)
    gc.collect()

I would expect performance to stay mostly constant until the n_threads parameter exceeds the number of physical cores of the OS. Instead, performance seems to be roughly linear in n_threads, which means that there is a bottleneck somewhere.

Interrupting the interpreter via Ctrl-C indicates that this bottleneck seems to be related to the construction of the Counter type in the example. However, the bottleneck occurs inside Python, and before it is forwarded to nanobind. Here is a backtrace:

 * frame #0: 0x0000000198dfff88 libsystem_kernel.dylib` swtch_pri  + 8
    frame #1: 0x0000000198e3ce34 libsystem_pthread.dylib` cthread_yield  + 32
    frame #2: 0x000000010026aaa0 python3.13` _PyMutex_LockTimed [inlined] _Py_yield  + 208 at lock.c:46
    frame #3: 0x000000010026aa9c python3.13` _PyMutex_LockTimed(m=0x00000001005f8a88, timeout=-1, flags=_PY_LOCK_DET
ACH)  + 204 at lock.c:88
    frame #4: 0x000000010026b4c0 python3.13` PyMutex_Lock(m=<unavailable>)  + 8 at lock.c:582
    frame #5: 0x000000010022e980 python3.13` _PyCriticalSection_BeginSlow [inlined] _PyMutex_Lock(m=0x00000001005f8a
88)  + 80 at lock.h:49
    frame #6: 0x000000010022e96c python3.13` _PyCriticalSection_BeginSlow(c=0x0000000173e2aac8, m=0x00000001005f8a88
)  + 60 at critical_section.c:20
    frame #7: 0x000000010013fc5c python3.13` _PyType_LookupRef [inlined] _PyCriticalSection_BeginMutex(c=0x000000017
3e2aac8, m=<unavailable>)  + 524 at pycore_critical_section.h:119
    frame #8: 0x000000010013fc2c python3.13` _PyType_LookupRef(type=0x0000020000d70010, name=0x00000001005b96f8)  +
476 at typeobject.c:5256
    frame #9: 0x0000000100152370 python3.13` lookup_maybe_method(self='0x2001f611b40', attr=<unavailable>, unbound=0
x0000000173e2ab7c)  + 32 at typeobject.c:2521
    frame #10: 0x000000010014d850 python3.13` slot_tp_init [inlined] lookup_method(self='0x2001f611b40', attr=<unava
ilable>, unbound=0x0000000173e2ab7c)  + 28 at typeobject.c:2543
    frame #11: 0x000000010014d834 python3.13` slot_tp_init(self='0x2001f611b40', args=(), kwds='0x00000000')  + 56 a
t typeobject.c:9773
    frame #12: 0x00000001001433b0 python3.13` type_call(self='0x20000d70010', args=(), kwds='0x00000000')  + 464 at
typeobject.c:1990
    frame #13: 0x0000000100096bcc python3.13` _PyObject_MakeTpCall(tstate=0x0000000101019210, callable='0x20000d7001
0', args='0x1009fc270', nargs=4305437280, keywords='0x00000000')  + 356 at call.c:242
    frame #14: 0x0000000100096838 python3.13` _PyObject_VectorcallTstate(tstate=<unavailable>, callable=<unavailable
>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>)  + 228 at pycore_call.h:166
    frame #15: 0x0000000100097704 python3.13` PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf
=<unavailable>, kwnames=<unavailable>)  + 48 at call.c:327
    frame #16: 0x00000001001f9b90 python3.13` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, th
rowflag=0)  + 13112 at generated_cases.c.h:813
    frame #17: 0x000000010009b1d4 python3.13` _PyObject_VectorcallTstate(tstate=0x0000000101019210, callable='0x2000
0756790', args='0x173e2aed8', nargsf=1, kwnames='0x00000000')  + 116 at pycore_call.h:168

Looking at the CPython code, it seems to me that there is a caching scheme in place that should make it possible to avoid the critical section. However, for some reason, this isn't happening.

Do you have any ideas what could be wrong?

Thank you!

@colesbury
Copy link

colesbury commented Aug 23, 2024

I think there is a combination of two performance issues:

First, instances of nanobind.nb_method (i.e., Counter.__init__) don't use deferred reference counting (nor are they immortal). So reference counting on those methods from multiple threads is going to be a bottleneck and inhibit scaling. Unfortunately, we don't have a public API currently to mark objects as using deferred reference counting. I think we can probably figure out a workaround, but it'd be good to get a public API (likely with the PyUnstable_ prefix) in 3.14.

The second problem is in the implementation of the MRO type cache in CPython. Basically, the fast-path lookup fails so we fall back to the locking slow path, but we're also not enabling future lookups in the fast-path to succeed. This case only happens with entries that don't use deferred reference counting (like nanobind.nb_method), which is why we missed it, but it makes that case worse: instead of having a contended atomic operation, we have a contended lock, which is even slower.

@wjakob
Copy link
Owner Author

wjakob commented Aug 23, 2024

Thank you for looking into this @colesbury! For context: nanobind uses a custom function object to implement an efficient function dispatch mechanism (i.e., faster and more flexible than the builtin C function wrappers). It is central to the design of the library—this design choice preventing multi-threaded speedups in free-threaded builds would be bad news.

Could you let me know if you see any way to force nb_method / nb_func to adopt deferred reference counting, even if hacky and only usable in a particular Python version? Rather than having a nice API in v3.14, I think it would be good to introduce unstable hack for v3.13. The latter is what I expect people to play around with when attempting to adapt their extension projects for free-threading.

@wjakob
Copy link
Owner Author

wjakob commented Aug 26, 2024

I tried doing the hack myself. Based on the CPython internals, I added the function

void enable_deferred(PyObject *op) {
    assert(PyType_IS_GC(Py_TYPE(op)) && _Py_IsOwnedByCurrentThread(op) &&
           op->ob_ref_shared == 0);
    op->ob_gc_bits |= 64;
    op->ob_ref_local += 1;
    op->ob_ref_shared = _Py_REF_QUEUED;
}

(This function is called on a newly created nb_func object that is local to the current thread)

Calling this on the function objects seems to fix the issue with critical sections. However, I'm still not seeing multithreaded speedups. Examining the callstack shows that most of the time is spent dealing with shared refcount atomics.

Does this mean that nanobind needs to bite the bullet and make immortal types and function objects? Is it correct to assume that this will break nanobind's leak tracking? For context: one feature of the library is that it complains loudly if any of its objects (classes, functions, instances) are still alive after Python has full shut down and invokes the Py_AtExit() callbacks.

@wjakob
Copy link
Owner Author

wjakob commented Aug 26, 2024

I tried the main branch (v3.14.0) as well immortalizing the functions and still observed significant performance regressions.

Then I thought I'd try this in plain Python, and it turns out that the behavior there is similar. So perhaps it's better to understand the situation there.

Here is a reproducer:

import threading

def parallelize(func, n_threads):
    barrier = threading.Barrier(n_threads)

    def wrapper():
        barrier.wait()
        return func()

    workers = []
    for _ in range(n_threads):
        t = threading.Thread(target=wrapper)
        t.start()
        workers.append(t)

    for worker in workers:
        worker.join()

def f():
    def q():
        return 1
    for i in range(1000000):
        q()

parallelize(f, n_threads=8)

Performance on an M1 Max laptop (macOS) on AC power

n_threads Runtime
1 49ms
2 202ms
4 610ms
8 2310ms

When I move the q function to the top level, the performance is worse.

n_threads Runtime
1 63ms
2 246ms
4 1402ms
8 5601ms

Especially in the former case with a callable declared per thread, I would have expected the runtime cost to be relatively flat until 8 threads (matching the number of performance cores in this machine.)

Backtrace

this is where time seems to be spent (for the former case with a local function):

* thread #2
    frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940)  + 8 at pyatomic_gcc.h:130
   127
   128  static inline int
   129  _Py_atomic_compare_exchange_ssize(Py_ssize_t *obj, Py_ssize_t *expected, Py_ssize_t desired)
-> 130  { return __atomic_compare_exchange_n(obj, expected, desired, 0,
   131                                       __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); }
   132
   133  static inline int
warning: python3.14t was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2
  * frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940)  + 8 at pyatomic_gcc.h:130
    frame #1: 0x00000001000c4e70 python3.14t` _Py_DecRefShared [inlined] _Py_DecRefSharedDebug(o=0x000003c380398960, filename=0x0000000000000000, lineno=0)  + 28 at object.c:353
    frame #2: 0x00000001000c4e54 python3.14t` _Py_DecRefShared(o=0x000003c380398960)  + 20 at object.c:371
    frame #3: 0x00000001001afed0 python3.14t` _PyEval_FrameClearAndPop  + 4 at ceval.c:1684
    frame #4: 0x00000001001afecc python3.14t` _PyEval_FrameClearAndPop(tstate=0x000000010081f200, frame=0x00000001006e0290)+ 232 at ceval.c:1710
    frame #5: 0x00000001001ab9d4 python3.14t` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=0x00000001006e0218, throwflag=<unavailable>)  + 58020 at generated_cases.c.h:6499
    frame #6: 0x000000010005e71c python3.14t` method_vectorcall [inlined] _PyObject_VectorcallTstate(tstate=0x000000010081f200, callable=0x000003c3805363c0, args=0x0000000171e12ee8, nargsf=1, kwnames=0x0000000000000000)  + 44 at pycore_call.h:167
    frame #7: 0x000000010005e6f0 python3.14t` method_vectorcall(method=<unavailable>, args=0x0000000100492b90, nargsf=<unavailable>, kwnames=0x0000000000000000)  + 284 at classobject.c:70
    frame #8: 0x000000010029a284 python3.14t` thread_run(boot_raw=0x0000600000c00e40)  + 128
    frame #9: 0x000000010022f2bc python3.14t` pythread_wrapper(arg=<unavailable>)  + 28
    frame #10: 0x0000000198e3ffa8 libsystem_pthread.dylib` _pthread_start  + 148

@colesbury
Copy link

colesbury commented Aug 26, 2024

Does this mean that nanobind needs to bite the bullet and make immortal types and function objects?

Yes, deferred reference counting isn't fully implemented in 3.13 and is more of a placeholder, so functions and types need to be immortalized.

I tried the main branch (v3.14.0) as well immortalizing the functions...

I'll take a look at this today or tomorrow.

@colesbury
Copy link

Hi @wjakob -

Here's a patch that demonstrates how to make functions and type objects immortal:

https://gist.github.com/colesbury/99e15d904d84d239231f465a5aca6c36

If you are doing multi-threaded benchmarking, you won't want to use suppress_immortalization. The tradeoff in 3.13 is that you either:

  1. immortalize objects (with the associated leaks on exit), and can scale to well to multiple threads/cores
  2. suppress immortalization (avoid leaks), but can't scale to multiple threads/cores

The other change was to move the from test_thread_ext import Counter outside of the function body of test01_object_creation. Closed over local variables are implemented using cell objects. Reading from a cell object currently takes a lock on the cell object instance. We should probably fix this for 3.14.

@wjakob
Copy link
Owner Author

wjakob commented Aug 27, 2024

Thank you very much @colesbury, it fixes the performance regression. I am now seeing the following performance on the v3.14 (main) branch:

n_threads Runtime
1 172ms
2 223ms
4 335ms
8 503ms

The cell performance caveat you mention sounds tricky to figure out for users, that would be good to fix indeed.

@wjakob
Copy link
Owner Author

wjakob commented Aug 27, 2024

Thank you for all your help so far @colesbury. I was wondering about one more thing:

In your pybind11 implementation, you sharded the (very frequently accessed) instance map based on the top 44 bits of the pointer address. The rationale was that this region is related the thread ID, and that way we can limit lock contention if every thread mostly works with its own objects. This sounded good to me, and I copied your approach in nanobind.

One potential issue is that nanobind extensions talk to each other through shared data structures, and that means that such a sharding scheme will effectively become part of the "ABI" (i.e. difficult to modify after the fact without breaking the ABI contract). It's the same also in pybind11. Hence I want to be sure that this is a good long-term solution.

Therefore, I was wondering where this came from, and what can be said about it? A few questions:

  1. Discarding the bottom 20 bits splits memory into 1MB-sized chunks., which isn't very much So it seems it can't directly map to the thread ID.

  2. There are actually two allocators: the system malloc() and mimalloc(). Do both of them allocate spatially segregated heaps for each thread? (In nanobind, pointers used for sharding can either be from a memory region allocated via PyObject_Malloc, or they can be on the C++ heap created via operator new/malloc())

  3. Do you know if this behavior is consistent across platforms?

Thank you!

@colesbury
Copy link

I'm not sure that using the top 44 bits does much to help with lock contention compared to using all the bits. We are still applying a hash function (fmix64) and indexing a much smaller table. So I'd expect each thread to more-or-less pseudo-randomly choose a shard in the table with the contention depending on the number of shards and the number of threads concurrently trying to access it.

To the extent it helps, it's because a given thread is likely to re-use the same shard for lookups (fewer cache misses). The improvement here is small compared to reducing lock contention by increasing the number of shards.

There's a good deal of flexibility in terms of how many low bits to discard -- I don't think it's particularly sensitive to the choice or to the allocator. The important things are just that:

  • You use enough bits that different threads are likely to have different keys (i.e., it'd be bad if everyone used shard 0!)
  • But not too many bits -- otherwise consecutive lookups from the same thread are likely to use random shards.

The first condition is more important. You definitely don't want to have multiple threads forced to use the same shard. But we have a good deal of slack here. Glibc's arenas are (mostly) per-thread and by default 64 MiB. Mimalloc's segments are per-thread and 32 MiB.

If you are concerned about the choice of shift, you can put it in the internals struct (and use that value instead of hardcoding the shift). I think that would give more flexibility about changing it without affecting the ABI. In other words, the creator of the nb_internals struct determines the shift amount. I don't think it's particularly important though.

I think more likely paths for improvement are:

  1. Increasing the number of shards, but that's harder to justify without more real world data.
  2. Storing the C++ struct -> PyObject* mapping on the C++ struct itself when possible.

@wjakob
Copy link
Owner Author

wjakob commented Aug 30, 2024

Hi @colesbury,

thank you your the detailed explanation and clarifying my misunderstanding. The PR is now almost done in the sense that all the internal data structures are protected for free-threading.

I wanted to ask about your thoughts regarding one more thing:

Binding tools like nanobind and pybind11 are often used to bind existing code. However, that code might often not be thread-safe and therefore unsafe to use in free-threaded Python. It's worse than pure Python code because of the possibility of memory-unsafe operations that crash the process.

To amelierate such issues, I would like to add a way for nanobind to lock Python objects involved in a function call (without having to modify the function itself).

Example: with this feature, the user could change a binding of a function with 3 named arguments

m.def("func", &func, nb::arg("argument_1"), nb::arg("argument_2"), nb::arg("argument_3"));

into one guaranteeing that the PyMutex of the first and third argument is held while the function runs.

m.def("func", &func, nb::arg("argument_1").lock(), nb::arg("argument_2"), nb::arg("argument_3").lock());

In general, this sounds like something that Py_BEGIN_CRITICAL_SECTION() and Py_BEGIN_CRITICAL_SECTION2() are made for. However, the interface I am describing would in principle allow to take out more than 2 locks.

At first, I thought this is simple: I'll just sort all of the objects by pointer address and then run PyMutex_Lock() on the ob_mutex of their PyObject *. But then I realized that ob_mutex is likely not part of the public API, and that it will be locked away if there is a free-threaded limited API at some point in the future.

Do you think what I am trying to do is possible in some forward-compatible way?

Thanks!

@colesbury
Copy link

Hi @wjakob - I don't think there's a way to do what you are proposing. You should avoid using PyMutex_Lock to lock PyObject.ob_mutex because that can lead to deadlocks. The object might already be locked by a critical section when your functions are called. Any given lock should only be locked with the critical section APIs or with PyMutex_Lock, but not a mix.

If possible, I'd recommend limiting the implementation to locking one or two arguments so that you can use the critical section APIs.

We do something similar in CPython with Argument Clinic: https://github.com/python/cpython/blob/782217f28f0d67916fc3ff82b03b88573686c0e7/Objects/setobject.c#L1998. In our cases:

  • The vast majority of uses lock a single object, typically the "self" object. There's only one a few cases where it makes sense to lock two objects at once.
  • It's not applicable everywhere: sometimes you need to lock a field on some argument or do some setup before locking.

Adapting C++ to handle parallelism due to free-threaded Python can be
tricky, especially when the original code is given as-is. This commit
an tentative API to retrofit locking onto existing code by locking the
arguments of function calls.
This commit documents free-threading in general and in the context of
nanobind extensions.
Several parallel tests to check that locking works as expected
@wjakob
Copy link
Owner Author

wjakob commented Sep 8, 2024

I've continued working on the free-threading PR and am now happy with it. Besides argument-level locking, the PR now also provides expanded documentation in docs/free_threaded.rst explaining how this all works, and what to watch out for.

I will leave it up for a few days in case someone else wants to take a look (@oremanj, @colesbury, @vfdev-5?)

@colesbury, I followed the approach you suggested, which was to only allow <= 2 locks per call and leverage the Python critical section API.
I have one request for you -- not critical, but perhaps something to look into for v3.14? The immortalization snippet you shared works well, but it does seem like an implementation detail leaking to a place where it doesn't belong. Probably CPython should add a PyObject_Immortalize() function or similar (there is _Py_SetImmortal, but that's only accessible within the interpreter).

@wjakob
Copy link
Owner Author

wjakob commented Sep 8, 2024

(Sorry, the above sounded unnecessarily stressful :-) — if anyone wants to review the change, of course you can have more than a few days. It is a quite big change).

@vfdev-5
Copy link

vfdev-5 commented Sep 9, 2024

@wjakob thanks for pinging for a review of this PR!
I wonder about a following use-case and an issue found recently in pybind11 (pybind/pybind11#5346) and whether nanobind would suffer from a similar problem or not? Thanks!

@wjakob
Copy link
Owner Author

wjakob commented Sep 9, 2024

The issue you referenced won't occur with nanobind, which uses a more fine-grained locking policy (as opposed to a global mutex for internals). The list of registered exception handlers in particular is assumed to be static/read-only following plugin initialization and not locked at all.

@wjakob wjakob changed the title WIP: free-threaded python Support for free-threaded python Sep 9, 2024
@wjakob wjakob changed the title Support for free-threaded python Support for free-threaded Python Sep 9, 2024
Comment on lines +24 to +34
To opt into free-threaded Python, pass the ``FREE_THREADED`` parameter to the
:cmake:command:`nanobind_add_module()` CMake target command. For other build
systems, refer to their respective documentation pages.

.. code-block:: cmake
nanobind_add_module(
my_ext # Target name
FREE_THREADED # Opt into free-threading
my_ext.h # Source code files below
my_ext.cpp)
Copy link

@vfdev-5 vfdev-5 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm not very sure to understand the Opt-in mechanism. We can opt into free-threading mode using CMake only ? For other build systems we need to explicitly define NB_FREE_THREADED flag?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cmake is the official build system of this project. You need to opt-in from there because this also changes how nanobind's add-on library is compiled. There are also a separate Bazel and Meson build systems maintained by @nicholasjng and @WillAyd. The authors of these will need to update their code with a minor adaptation, but that is not part of this PR.

When free threading is available and requested, the build sytem will automatically set NB_FREE_THREADED when compiling the C++ extension code. The user doesn't need to use this, but it could be useful in some cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, I'm tracking this in nicholasjng/nanobind-bazel#37 for Bazel. Contributions welcome!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meson >= 1.5.0 natively supports building free-threaded extensions from a free-threaded interpreter. In theory, the only thing we will have to do with that configuration is add the NB_FREE_THREADED define when that type of interpreter is detected, although if you wanted to experiment early you can try updating your wrap configuration to point to head here and manually define the macro:

meson setup builddir -Dcpp_args="$CXXFLAGS -DNB_FREE_THREADED"

or

pip install <your_extension> -Csetup-args='-Dcpp_args="$CXXFLAGS -DNB_FREE_THREADED'

if using the pip front-end

@colesbury
Copy link

I read through the PR and it looks good to me, although I don't have a lot of familiarity with nanobind.

@wjakob
Copy link
Owner Author

wjakob commented Sep 10, 2024

Thank you @colesbury. Would you be open to adding an offical API call for the immortalization step? I'm concerned that having this fragment (https://github.com/wjakob/nanobind/pull/695/files#diff-6de720f0a86306e2563f4f4606136d14fb439bad73da2fb7641029d0e37e6d95R1181) be a part of nanobind is not a good long-term solution.

@colesbury
Copy link

@wjakob - yes, I would like it as part of the C API. We will need to propose it to the C API Working Group: https://github.com/capi-workgroup/decisions/issues

I think we will want at least two functions:

  • One to mark an object as immortal
  • One to mark an object as using deferred reference counting, as that will be the more appropriate behavior in 3.14+

Copy link
Contributor

@oremanj oremanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

docs/free_threaded.rst Outdated Show resolved Hide resolved
docs/free_threaded.rst Outdated Show resolved Hide resolved
src/nb_enum.cpp Outdated Show resolved Hide resolved
@@ -184,6 +188,50 @@ char *strdup_check(const char *s) {
return result;
}

#if defined(NB_FREE_THREADED)
// Locked function wrapper for free-threaded Python
struct locked_func {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how often locking is likely to be used in practice, but it seems like we don't strictly need to pay the cost of the additional indirection that's currently needed here because of type-erasure of the locked function. The alternative would look something like:

  • add types arg_locked and arg_v_locked so that the information on whether an argument is [potentially] locked can be carried in the type rather than just the value; arg::lock() returns an arg_locked, etc
  • make func_create in nb_func.h notice when it receives some arg_locked, arg_v_locked, or self_locked arguments, and conditionally instantiate a ft_object2_guard in its impl lambda (the most natural choice would be to use std::optional, but a custom RAII guard would also work if you don't want to take that header dependency)

That would also let us statically assert that at most 2 arguments can be locked, rather than relying on a dynamic check; and I think it might allow removing cast_flags::locked as well, which would make it feasible to allow using the simple function dispatcher even with locked args (if the user passes only empty nb::arg() params with no changes to none/convert/name/default). Up to you whether you think it's worth the trouble.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make func_create in nb_func.h notice when it receives some arg_locked, arg_v_locked, or self_locked arguments, and conditionally instantiate a ft_object2_guard in its impl lambda (the most natural choice would be to use std::optional, but a custom RAII guard would also work if you don't want to take that header dependency)

This sounds really intriguing. What role does std::optional play? How would you conditionally insert the ft_object2_guard with templating, and how would it know which arguments to fetch? I had not considered this could be done in the compile-time part and don't see a reasonably simple solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m on vacation until Sunday but will try to prototype this when I’m back!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be amazing, thank you!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wanted to add: I fear the meta-templating needed to pull this off might push things above a complexity threshold that I would be willing to accept. Right now, the solution has a small runtime cost, but users only need to pay that cost when using the locking feature. It's not perfect but an OK solution. Basically: don't spend a lot of time on it if you feel like the solution will turn into a template monstrosity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at #720 which implements this. I think it came out pretty reasonable as far as complexity goes.

src/nb_func.cpp Show resolved Hide resolved
src/nb_internals.h Outdated Show resolved Hide resolved
type_data *nb_type_c2p(nb_internals *internals_,
const std::type_info *type) {
#if defined(NB_FREE_THREADED)
thread_local nb_type_map_fast type_c2p_fast;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places use static NB_THREAD_LOCAL for thread-locals, should this one normalize that way?

To check my understanding - this means each nanobind module and each thread will have its own copy of type_c2p_fast, right? Want to make sure there's no risk of inadvertent sharing between different nanobind modules (with potentially different ABI versions) that are running in the same thread. I think it's fine because this definition at local scope wouldn't have linkage, but sometimes dynamic linking semantics surprise me.

Copy link
Owner Author

@wjakob wjakob Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the function-local declaration anonymizes this symbol. In free-threaded mode, this variable plays the role of a local TLS cache to accelerate lookups.

Each module will have its own cache, potentially with separate entries depending on what types it looks up while dispatching function calls.

There is no inherent reason that this cache couldn't also be shared between all of the ABI-compatible extensions (still per-thread). Indeed, this is the role of the internals data structure that enables such sharing while being cautious of ABI incompatibilities. In principle, it should be possible to place the TLS variable there. However, non-static thread_local members are not allowed in C++ class declarations, which is why I chose this alternative. If you have other ideas, I'm open to them.

Python exposes its own TLS wrappers that seem convenient and would enable putting TLS data into internals (see the Py_tss_t *nb_static_property_disabled member of nb_internals for an example). However, I did not see an easy way to guarantee that the data structure destructor is called when the thread shuts down, which thread_local will do at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the C++ thread-local seems to be the best choice here.


#if defined(NB_FREE_THREADED)
// In free-threaded mode, stale type information remains in the
// 'type_c2p_fast' TLS. It is freed when the thread terminates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIsn't this potentially a memory safety hazard? If one drops all references to a nanobind type (such as by del'ing it from its module and not having any active instances or subclasses of it), it shouldn't be possible for later attempts to convert its C++ type to Python to rely on a now-dangling pointer. I think this could be fixed by moving the thread-local map into a separate function that both nb_type_c2p and nb_type_unregister can call.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that one thread can unregister a type, but many other threads could have that type in their caches. So you would need to delete it there as well, which requires synchronization. However, the point of this construction is to avoid synchronization.

The comment here is missing an important detail: in free-threaded mode, we immortalize type objects in type_new() and let them leak. This means that this function will only ever be called to delete the type_data* of C++ types that are subclassed in Python, and these don't have a unique entry in type_c2p_fast.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question would be what to do if we don't immortalize type objects at some point in the future (which is unfortunate because of the leaks). Is there a way of having fast unlooked lookups while catching the case you mention.

The only think I can think of is that there is some kind of global version counter incremented in nb_type_unregister, which is also cached in the TLS. If the version is different, the cache must be cleared. This would work most of the time but is still prone to a race when one thread is unregistering a type, while the other one is currently performing a lookup in the cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is safe now because we immortalize. It would be nice to mention that in the comment here.

As far as the potentially non-immortal future goes: one useful trick here is to increment the counter once before you remove the type from the c2p_slow map, and once again afterward. If lookup finds the counter at an odd value, it must retry. I think this is safe as long as you use an atomic value for the counter, guard unregistration with the internals mutex so multiple threads can't unregister simultaneously, and use acquire ordering for the first increment and release for the second.

src/nb_type.cpp Outdated Show resolved Hide resolved
docs/typeslots.rst Show resolved Hide resolved
@wjakob
Copy link
Owner Author

wjakob commented Sep 16, 2024

I realized one more thing: the nb::dict iterator is tricky because it relies on the thread-unsafe PyDict_Next API. I at first tried to put a python critical section into the iterator object, but that is likely not a good idea because the iterator objects are often copied. (In particular, my impression from the Python critical section is that repeated lock acquisition of the same object triggers some kind of costly fallback path)

So for now the guidance is for the user to add a nb::ft_object_guard when iterating over dictionaries (see the added documentation). I am open to other solutions if you have suggestions.

@wjakob
Copy link
Owner Author

wjakob commented Sep 16, 2024

Thank you very much for the thorough feedback @oremanj.

{
lock_internals guard(internals_);
nb_ptr_map::iterator it;
bool success;
std::tie(it, success) = internals_->funcs.try_emplace(func, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you no longer need the additional scope for the lock guard's lifetime, I think you could switch this back to the auto [it, success] = ... at top-level that it was before the FT changes.

src/nb_func.cpp Show resolved Hide resolved
@oremanj
Copy link
Contributor

oremanj commented Sep 16, 2024

So for now the guidance is for the user to add a nb::ft_object_guard when iterating over dictionaries (see the added documentation). I am open to other solutions if you have suggestions.

The new C++20 input_iterator concept does not require input iterators to be copyable, so a potential solution here might be to give the dict iterator itself a ft_object_guard member, thus making it move-only. (This could be done using conditional compilation only for FT builds in order to not present a backcompat hazard.) I think this would work fine with most user code, which is just iterating over a dict; but it's certainly possible to construct cases that would no longer compile under this transformation, since the non-rangeified standard library algorithms are still defined in terms of "LegacyInputIterator" which does require copyability. I think it's worth making it anyway, because it's so easy otherwise to forget to lock the dict while iterating, and it's much easier to fix compilation errors than it is to fix thread-safety bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants