free-threading: Documentation

This commit documents free-threading in general and in the context of nanobind extensions.
wjakob · Sep 8, 2024 · 47d9b76 · 47d9b76
1 parent 2b55a2b
commit 47d9b76
Show file tree

Hide file tree

Showing 3 changed files with 301 additions and 2 deletions.
diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -18,8 +18,14 @@ below inherit that of the preceding release.
 Version 2.2.0 (TBA)
 -------------------
 
-- The NVIDIA CUDA compiler (``nvcc``) is now explicitly supported and included
-  in nanobind's CI test suite.
+- nanobind can now target `free-threaded Python
+  <https://py-free-threading.github.io>`__, which replaces the `Global
+  Interpreter Lock (GIL)
+  <https://en.wikipedia.org/wiki/Global_interpreter_lock>`__ with a
+  fine-grained locking scheme (see `PEP 703
+  <https://peps.python.org/pep-0703/>`__) to better leverage multi-core
+  parallelism. A `separate documation page <free-threading>`__ explains this in
+  detail.
 
 - nanobind has always used `PEP 590 vector calls
   <https://www.python.org/dev/peps/pep-0590>`__ to efficiently dispatch calls
@@ -41,6 +47,9 @@ Version 2.2.0 (TBA)
   with :cpp:class:`nb::is_arithmetic() <is_flag>` creates enumerations deriving
   from :py:class:`enum.IntFlag`.
 
+- The NVIDIA CUDA compiler (``nvcc``) is now explicitly supported and included
+  in nanobind's CI test suite.
+
 * Added support for return value policy customization to the type casters of
   ``Eigen::Ref<...>`` and ``Eigen::Map<...>`` (commit `67316e
   <https://github.com/wjakob/nanobind/commit/67316eb88955a15e8e89a57ce9a53d8d66263287>`__).

diff --git a/docs/free_threaded.rst b/docs/free_threaded.rst
@@ -0,0 +1,289 @@
+.. _free-threaded:
+
+.. cpp:namespace:: nanobind
+
+Free-threaded Python
+====================
+
+**Free-threading** is an experimental new Python feature that replaces the
+`Global Interpreter Lock (GIL)
+<https://en.wikipedia.org/wiki/Global_interpreter_lock>`__ with a fine-grained
+locking scheme to better leverage multi-core parallelism. The resulting
+benefits do not come for free: extensions must explicitly opt-in and generally
+require careful modifications to ensure correctness.
+
+Nanobind can target free-threaded Python since version 2.2.0. This page
+explains how to do so and discusses a few caveats. Besides this page, make sure
+to review `py-free-threading.github.io <https://py-free-threading.github.io>`__
+for a more comprehensive discussion of free-threaded Python. `PEP 703
+<https://peps.python.org/pep-0703/>`__ explains the nitty gritty details.
+
+Opting in
+---------
+
+To opt into free-threaded Python, pass the ``FREE_THREADED`` parameter to the
+:cmake:command:`nanobind_add_module()` CMake target command. For other build
+systems, refer to their respective documentation pages.
+
+.. code-block:: cmake
+
+      nanobind_add_module(
+        my_ext                   # Target name
+        FREE_THREADED            # Opt into free-threading
+        my_ext.h                 # Source code files below
+        my_ext.cpp)
+
+nanobind ignores the ``FREE_THREADED`` parameter when the registered Python
+version does not support free-threading.
+
+.. note::
+
+   **Stable ABI**: Note that there currently is no stable ABI for free-threaded
+   Python, hence the ``STABLE_ABI`` parameter will be ignored in free-threaded
+   extensions builds. It is valid to combine the ``STABLE_ABI`` and
+   ``FREE_THREADED`` arguments: the build system will choose between the two
+   depending on the detected Python version.
+
+   Warning: Loading a non-free threaded extension into a free-threaded Python
+   build disables free-threading globally.
+
+If free-threading was requested and is available, the build system will set the
+``NB_FREE_THREADED`` preprocessor flag. This can be helpful to specialize
+binding code with ``#ifdef`` blocks, e.g.:
+
+.. code-block:: cpp
+
+   #if !defined(NB_FREE_THREADED)
+   ... // simple GIL-protected code
+   #else
+   ... // more complex thread-aware code
+   #endif
+
+Caveats
+-------
+
+Free-threading can violate implicit assumptions made by extension developers
+when previously serial operations suddenly run concurrently, producing
+undefined behavior (race conditions, crashes, etc.).
+
+Let's consider a concrete example: the binding code below defines a ``Counter``
+class with an increment operation.
+
+.. code-block:: cpp
+
+   struct Counter {
+       int value = 0;
+       void inc() { value++; }
+   };
+
+   nb::class_<Counter>(m, "Counter")
+       .def("inc", &Counter::inc)
+       .def_ro("value", &Counter::value);
+
+
+If multiple threads call the ``inc()`` method of a single ``Counter``, the
+final count will generally be incorrect, as the increment operation ``value++``
+does not execute atomically.
+
+To fix this, we could modify the C++ type so that it protects its ``value``
+member from concurrent modification, for example using an atomic number type
+(e.g., ``std::atomic<int>``) or a critical section (e.g., based on
+``std::mutex``).
+
+The race condition in the above example is relatively benign. However,
+in more complex projects, combinations of concurrency and unsafe memory
+accesses could introduce non-deterministic data corruption and crashes.
+
+Another common source of problems are *global variables* undergoing concurrent
+modification when no longer protected by the GIL. They will likewise require
+supplemental locking. The :ref:`next section <free-threaded-locks>` explains a
+Python-specific locking primitive that can be used in binding code besides
+the solutions mentioned above.
+
+.. _free-threaded-locks:
+
+Python locks
+------------
+
+Nanobind provides convenience functionality encapsulating the mutex
+implementation that is part of Python ("``PyMutex``"). It is slightly more
+efficient than OS/language-provided synchronization primitives and generally
+preferable within Python extensions.
+
+The class :cpp:class:`ft_mutex` is analogous to ``std::mutex``, and
+:cpp:class:`ft_lock_guard` is analogous to ``std::lock_guard``. Note that they
+only exist to add *supplemental* critical sections needed in free-threaded
+Python, while becoming inactive (no-ops) when targeting regular GIL-protected
+Python.
+
+With these abstractions, the previous ``Counter`` implementation could be
+rewritten as:
+
+.. code-block:: cpp
+   :emphasize-lines: 3,6
+
+   struct Counter {
+       int value = 0;
+       nb::ft_mutex mutex;
+
+       void inc() {
+           nb::ft_lock_guard guard(mutex);
+           value++;
+       }
+   };
+
+These locks are very compact (``sizeof(nb::ft_mutex) == 1``), though this is a
+Python implementation detail that could change in the future.
+
+.. _argument-locks:
+
+Argument locking
+----------------
+
+Modifying class and function definitions as shown above may not always be
+possible. As an alternative, nanobind also provides a way to *retrofit*
+supplemental locking onto existing code. The idea is to lock individual
+arguments of a function *before* being allowed to invoke it. A built-in mutex
+present in every Python object enables this.
+
+To do so, call the :cpp:func:`.lock() <arg::lock>` member of
+:cpp:class:`nb::arg() <arg>` annotations to indicate that an
+argument must be locked, e.g.:
+
+- ``nb::arg("my_parameter").lock()``
+- ``"my_parameter"_a.lock()`` (short-hand form)
+
+In methods bindings, pass :cpp:struct:`nb::lock_self() <lock_self>` to lock
+the implicit ``self`` argument. Note that at most 2 arguments can be
+locked per function, which is a limitation of the `Python locking API
+<https://docs.python.org/3.13/c-api/init.html#c.Py_BEGIN_CRITICAL_SECTION2>`__.
+
+The example below shows how this functionality can be used to protect ``inc()``
+and a new ``merge()`` function that acquires two simultaneous locks.
+
+.. code-block:: cpp
+
+   struct Counter {
+       int value = 0;
+
+       void inc() { value++; }
+       void merge(Counter &other) {
+           value += other.value;
+           other.value = 0;
+       }
+   };
+
+   nb::class_<Counter>(m, "Counter")
+       .def("inc", &Counter::inc, nb::lock_self())
+       .def("merge", &Counter::merge, nb::lock_self(), "other"_a.lock())
+       .def_ro("value", &Counter::value);
+
+The above solution has an obvious drawback: it only protects *bindings* (i.e.,
+transitions from Python to C++). For example, if some other part of a C++
+codebase calls ``merge()`` directly, the binding layer won't be involved, and
+no locking takes place. If such behavior can introduce race conditions, a
+larger-scale redesign of your project may be in order.
+
+.. note::
+
+   Adding locking annotations indiscriminately is inadvisable because they add
+   a runtime cost to function call dispatcher.
+   The :cpp:func:`.lock() <arg::lock>` annotation is ignored in GIL-protected
+   builds. Note that listing arguments in function bindings generally comes at
+   a small cost in terms of :ref:`binding overheads <binding-overheads>`.
+
+.. note::
+
+   **Python API and locking**: When the lock-protected function performs Python
+   API calls (e.g., using :ref:`wrappers <wrappers>` like :cpp:class:`nb::dict
+   <dict>`), Python may temporarily release locks to avoid deadlocks. Here,
+   even basic reference counting such as a :cpp:class:`nb::object
+   <object>` variable expiring at the end of a scope counts as an API call.
+
+   These locks will be reacquired following the Python API call. This behavior
+   resembles ordinary (GIL-protected) Python code, where operations like
+   `Py_DECREF()
+   <https://docs.python.org/3/c-api/refcounting.html#c.Py_DECREF>`__ can cause
+   cause arbitrary Python code to execute. The semantics of this kind of
+   relaxed critical section are described in the `Python documentation
+   <https://docs.python.org/3.13/c-api/init.html#python-critical-section-api>`__.
+
+Miscellaneous notes
+-------------------
+
+API
+---
+
+The following API specific to free-threading has been added:
+
+- :cpp:class:`nb::ft_mutex <ft_mutex>`
+- :cpp:class:`nb::ft_lock_guard <ft_lock_guard>`
+- :cpp:class:`nb::ft_object_guard <ft_object_guard>`
+- :cpp:class:`nb::ft_object2_guard <ft_object2_guard>`
+- :cpp:func:`nb::arg::lock() <arg::lock>`
+
+API stability
+_____________
+
+The interface explained in this is excluded from the project's semantic
+versioning policy. Free-threading is still experimental, and API breaks may be
+necessary based on future experience and changes in Python itself.
+
+Wrappers
+________
+
+:ref:`Wrapper types <wrappers>` like :cpp:class:`nb::list <list>` may used in
+multi-threaded code. Operations like :cpp:func:`nb::list::append()
+<list::append>` internally acquire locks and behave just like their ordinary
+Python counterparts. This means that race conditions can still occur without
+larger-scale synchronization, but such races won't jeopardize the memory safety
+of the program.
+
+GIL scope guards
+________________
+
+Prior to free-threaded Python, the nanobind scope guards
+:cpp:struct:`gil_scoped_acquire` and :cpp:struct:`gil_scoped_release` would
+normally be used to acquire/release the GIL and enable parallel regions.
+
+These remain useful and should not be removed from existing code: while no
+longer blocking operations, they set and unset the current Python thread
+context and inform the garbage collector.
+
+The :cpp:struct:`gil_scoped_release` RAII scope guard class plays a special
+role in free-threaded builds, since it releases all :ref:`argument locks
+<argument-locks>` held by the current thread.
+
+Immortalization
+_______________
+
+Python relies on a technique called *reference counting* to determine when an
+object is no longer needed. This approach can become a bottleneck in
+multi-threaded programs, since increasing and decreasing reference counts
+requires coordination among multiple processor cores. Python type and function
+objects are especially sensitive, since their reference counts change at a very
+high rate.
+
+Similar to free-threaded Python itself,  nanobind avoids this bottleneck by
+*immortalizing* functions (``nanobind.nb_func``, ``nanobind.nb_method``) and
+type bindings. Immortal objects don't require reference counting. In turn, the
+downside is that they leak when the interpreter shuts down. Free-threaded
+nanobind extensions disable the internal :ref:`leak checker <leak-checker>`,
+since it would produce many warning messages caused by immortal objects.
+
+Internal data structures
+________________________
+
+Nanobind maintains various internal data structures that store information
+about instances and function/type bindings. These data structures also play an
+important role to exchange type/instance data in larger projects that are split
+across several independent extension modules.
+
+The layout of these data structures differs between ordinary and free-threaded
+extensions, therefore nanobind isolates them from each other by assigning a
+different ABI version tag. This means that multi-module projects will need
+to consistently compile either free-threaded or non-free-threaded modules.
+
+Free-threaded nanobind uses thread-local and sharded data structures to avoid
+lock and atomic contention on the internal data structures, which would
+otherwise become a bottleneck in multi-threaded Python programs.
diff --git a/docs/index.rst b/docs/index.rst
@@ -132,6 +132,7 @@ The nanobind logo was designed by `AndoTwin Studio
    :caption: Advanced
    :maxdepth: 1
 
+   free_threaded
    ownership_adv
    lowlevel
    typeslots