Skip to content

Commit

Permalink
free-threading: Documentation
Browse files Browse the repository at this point in the history
This commit documents free-threading in general and in the context of
nanobind extensions.
  • Loading branch information
wjakob committed Sep 8, 2024
1 parent 2b55a2b commit 47d9b76
Show file tree
Hide file tree
Showing 3 changed files with 301 additions and 2 deletions.
13 changes: 11 additions & 2 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,14 @@ below inherit that of the preceding release.
Version 2.2.0 (TBA)
-------------------

- The NVIDIA CUDA compiler (``nvcc``) is now explicitly supported and included
in nanobind's CI test suite.
- nanobind can now target `free-threaded Python
<https://py-free-threading.github.io>`__, which replaces the `Global
Interpreter Lock (GIL)
<https://en.wikipedia.org/wiki/Global_interpreter_lock>`__ with a
fine-grained locking scheme (see `PEP 703
<https://peps.python.org/pep-0703/>`__) to better leverage multi-core
parallelism. A `separate documation page <free-threading>`__ explains this in
detail.

- nanobind has always used `PEP 590 vector calls
<https://www.python.org/dev/peps/pep-0590>`__ to efficiently dispatch calls
Expand All @@ -41,6 +47,9 @@ Version 2.2.0 (TBA)
with :cpp:class:`nb::is_arithmetic() <is_flag>` creates enumerations deriving
from :py:class:`enum.IntFlag`.

- The NVIDIA CUDA compiler (``nvcc``) is now explicitly supported and included
in nanobind's CI test suite.

* Added support for return value policy customization to the type casters of
``Eigen::Ref<...>`` and ``Eigen::Map<...>`` (commit `67316e
<https://github.com/wjakob/nanobind/commit/67316eb88955a15e8e89a57ce9a53d8d66263287>`__).
Expand Down
289 changes: 289 additions & 0 deletions docs/free_threaded.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
.. _free-threaded:

.. cpp:namespace:: nanobind

Free-threaded Python
====================

**Free-threading** is an experimental new Python feature that replaces the
`Global Interpreter Lock (GIL)
<https://en.wikipedia.org/wiki/Global_interpreter_lock>`__ with a fine-grained
locking scheme to better leverage multi-core parallelism. The resulting
benefits do not come for free: extensions must explicitly opt-in and generally
require careful modifications to ensure correctness.

Nanobind can target free-threaded Python since version 2.2.0. This page
explains how to do so and discusses a few caveats. Besides this page, make sure
to review `py-free-threading.github.io <https://py-free-threading.github.io>`__
for a more comprehensive discussion of free-threaded Python. `PEP 703
<https://peps.python.org/pep-0703/>`__ explains the nitty gritty details.

Opting in
---------

To opt into free-threaded Python, pass the ``FREE_THREADED`` parameter to the
:cmake:command:`nanobind_add_module()` CMake target command. For other build
systems, refer to their respective documentation pages.

.. code-block:: cmake
nanobind_add_module(
my_ext # Target name
FREE_THREADED # Opt into free-threading
my_ext.h # Source code files below
my_ext.cpp)
nanobind ignores the ``FREE_THREADED`` parameter when the registered Python
version does not support free-threading.

.. note::

**Stable ABI**: Note that there currently is no stable ABI for free-threaded
Python, hence the ``STABLE_ABI`` parameter will be ignored in free-threaded
extensions builds. It is valid to combine the ``STABLE_ABI`` and
``FREE_THREADED`` arguments: the build system will choose between the two
depending on the detected Python version.

Warning: Loading a non-free threaded extension into a free-threaded Python
build disables free-threading globally.

If free-threading was requested and is available, the build system will set the
``NB_FREE_THREADED`` preprocessor flag. This can be helpful to specialize
binding code with ``#ifdef`` blocks, e.g.:

.. code-block:: cpp
#if !defined(NB_FREE_THREADED)
... // simple GIL-protected code
#else
... // more complex thread-aware code
#endif
Caveats
-------

Free-threading can violate implicit assumptions made by extension developers
when previously serial operations suddenly run concurrently, producing
undefined behavior (race conditions, crashes, etc.).

Let's consider a concrete example: the binding code below defines a ``Counter``
class with an increment operation.

.. code-block:: cpp
struct Counter {
int value = 0;
void inc() { value++; }
};
nb::class_<Counter>(m, "Counter")
.def("inc", &Counter::inc)
.def_ro("value", &Counter::value);
If multiple threads call the ``inc()`` method of a single ``Counter``, the
final count will generally be incorrect, as the increment operation ``value++``
does not execute atomically.

To fix this, we could modify the C++ type so that it protects its ``value``
member from concurrent modification, for example using an atomic number type
(e.g., ``std::atomic<int>``) or a critical section (e.g., based on
``std::mutex``).

The race condition in the above example is relatively benign. However,
in more complex projects, combinations of concurrency and unsafe memory
accesses could introduce non-deterministic data corruption and crashes.

Another common source of problems are *global variables* undergoing concurrent
modification when no longer protected by the GIL. They will likewise require
supplemental locking. The :ref:`next section <free-threaded-locks>` explains a
Python-specific locking primitive that can be used in binding code besides
the solutions mentioned above.

.. _free-threaded-locks:

Python locks
------------

Nanobind provides convenience functionality encapsulating the mutex
implementation that is part of Python ("``PyMutex``"). It is slightly more
efficient than OS/language-provided synchronization primitives and generally
preferable within Python extensions.

The class :cpp:class:`ft_mutex` is analogous to ``std::mutex``, and
:cpp:class:`ft_lock_guard` is analogous to ``std::lock_guard``. Note that they
only exist to add *supplemental* critical sections needed in free-threaded
Python, while becoming inactive (no-ops) when targeting regular GIL-protected
Python.

With these abstractions, the previous ``Counter`` implementation could be
rewritten as:

.. code-block:: cpp
:emphasize-lines: 3,6
struct Counter {
int value = 0;
nb::ft_mutex mutex;
void inc() {
nb::ft_lock_guard guard(mutex);
value++;
}
};
These locks are very compact (``sizeof(nb::ft_mutex) == 1``), though this is a
Python implementation detail that could change in the future.

.. _argument-locks:

Argument locking
----------------

Modifying class and function definitions as shown above may not always be
possible. As an alternative, nanobind also provides a way to *retrofit*
supplemental locking onto existing code. The idea is to lock individual
arguments of a function *before* being allowed to invoke it. A built-in mutex
present in every Python object enables this.

To do so, call the :cpp:func:`.lock() <arg::lock>` member of
:cpp:class:`nb::arg() <arg>` annotations to indicate that an
argument must be locked, e.g.:

- ``nb::arg("my_parameter").lock()``
- ``"my_parameter"_a.lock()`` (short-hand form)

In methods bindings, pass :cpp:struct:`nb::lock_self() <lock_self>` to lock
the implicit ``self`` argument. Note that at most 2 arguments can be
locked per function, which is a limitation of the `Python locking API
<https://docs.python.org/3.13/c-api/init.html#c.Py_BEGIN_CRITICAL_SECTION2>`__.

The example below shows how this functionality can be used to protect ``inc()``
and a new ``merge()`` function that acquires two simultaneous locks.

.. code-block:: cpp
struct Counter {
int value = 0;
void inc() { value++; }
void merge(Counter &other) {
value += other.value;
other.value = 0;
}
};
nb::class_<Counter>(m, "Counter")
.def("inc", &Counter::inc, nb::lock_self())
.def("merge", &Counter::merge, nb::lock_self(), "other"_a.lock())
.def_ro("value", &Counter::value);
The above solution has an obvious drawback: it only protects *bindings* (i.e.,
transitions from Python to C++). For example, if some other part of a C++
codebase calls ``merge()`` directly, the binding layer won't be involved, and
no locking takes place. If such behavior can introduce race conditions, a
larger-scale redesign of your project may be in order.

.. note::

Adding locking annotations indiscriminately is inadvisable because they add
a runtime cost to function call dispatcher.
The :cpp:func:`.lock() <arg::lock>` annotation is ignored in GIL-protected
builds. Note that listing arguments in function bindings generally comes at
a small cost in terms of :ref:`binding overheads <binding-overheads>`.

.. note::

**Python API and locking**: When the lock-protected function performs Python
API calls (e.g., using :ref:`wrappers <wrappers>` like :cpp:class:`nb::dict
<dict>`), Python may temporarily release locks to avoid deadlocks. Here,
even basic reference counting such as a :cpp:class:`nb::object
<object>` variable expiring at the end of a scope counts as an API call.

These locks will be reacquired following the Python API call. This behavior
resembles ordinary (GIL-protected) Python code, where operations like
`Py_DECREF()
<https://docs.python.org/3/c-api/refcounting.html#c.Py_DECREF>`__ can cause
cause arbitrary Python code to execute. The semantics of this kind of
relaxed critical section are described in the `Python documentation
<https://docs.python.org/3.13/c-api/init.html#python-critical-section-api>`__.

Miscellaneous notes
-------------------

API
---

The following API specific to free-threading has been added:

- :cpp:class:`nb::ft_mutex <ft_mutex>`
- :cpp:class:`nb::ft_lock_guard <ft_lock_guard>`
- :cpp:class:`nb::ft_object_guard <ft_object_guard>`
- :cpp:class:`nb::ft_object2_guard <ft_object2_guard>`
- :cpp:func:`nb::arg::lock() <arg::lock>`

API stability
_____________

The interface explained in this is excluded from the project's semantic
versioning policy. Free-threading is still experimental, and API breaks may be
necessary based on future experience and changes in Python itself.

Wrappers
________

:ref:`Wrapper types <wrappers>` like :cpp:class:`nb::list <list>` may used in
multi-threaded code. Operations like :cpp:func:`nb::list::append()
<list::append>` internally acquire locks and behave just like their ordinary
Python counterparts. This means that race conditions can still occur without
larger-scale synchronization, but such races won't jeopardize the memory safety
of the program.

GIL scope guards
________________

Prior to free-threaded Python, the nanobind scope guards
:cpp:struct:`gil_scoped_acquire` and :cpp:struct:`gil_scoped_release` would
normally be used to acquire/release the GIL and enable parallel regions.

These remain useful and should not be removed from existing code: while no
longer blocking operations, they set and unset the current Python thread
context and inform the garbage collector.

The :cpp:struct:`gil_scoped_release` RAII scope guard class plays a special
role in free-threaded builds, since it releases all :ref:`argument locks
<argument-locks>` held by the current thread.

Immortalization
_______________

Python relies on a technique called *reference counting* to determine when an
object is no longer needed. This approach can become a bottleneck in
multi-threaded programs, since increasing and decreasing reference counts
requires coordination among multiple processor cores. Python type and function
objects are especially sensitive, since their reference counts change at a very
high rate.

Similar to free-threaded Python itself, nanobind avoids this bottleneck by
*immortalizing* functions (``nanobind.nb_func``, ``nanobind.nb_method``) and
type bindings. Immortal objects don't require reference counting. In turn, the
downside is that they leak when the interpreter shuts down. Free-threaded
nanobind extensions disable the internal :ref:`leak checker <leak-checker>`,
since it would produce many warning messages caused by immortal objects.

Internal data structures
________________________

Nanobind maintains various internal data structures that store information
about instances and function/type bindings. These data structures also play an
important role to exchange type/instance data in larger projects that are split
across several independent extension modules.

The layout of these data structures differs between ordinary and free-threaded
extensions, therefore nanobind isolates them from each other by assigning a
different ABI version tag. This means that multi-module projects will need
to consistently compile either free-threaded or non-free-threaded modules.

Free-threaded nanobind uses thread-local and sharded data structures to avoid
lock and atomic contention on the internal data structures, which would
otherwise become a bottleneck in multi-threaded Python programs.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ The nanobind logo was designed by `AndoTwin Studio
:caption: Advanced
:maxdepth: 1

free_threaded
ownership_adv
lowlevel
typeslots
Expand Down

0 comments on commit 47d9b76

Please sign in to comment.