nanobind — Seamless operability between C++17 and Python

nanobind is a small binding library that exposes C++ types in Python and vice versa. It is reminiscent of Boost.Python and pybind11 and uses near-identical syntax. In contrast to these existing tools, nanobind is more efficient: bindings compile in a shorter amount of time, producing smaller binaries with better runtime performance.

Why yet another binding library?

I started the pybind11 project back in 2015 to generate better and more efficient C++/Python bindings. Thanks to many amazing contributions by others, pybind11 has become a core dependency of software across the world including flagship projects like PyTorch and Tensorflow. Every day, the repository is cloned more than 100.000 times. Hundreds of contributed extensions and generalizations address use cases of this diverse audience. However, all of this success also came with costs: the complexity of the library grew tremendously, which had a negative impact on efficiency.

Ironically, the situation today feels like 2015 all over again: binding generation with existing tools (Boost.Python, pybind11) is slow and produces enormous binaries with overheads on runtime performance. At the same time, key improvements in C++17 and Python 3.8 provide opportunities for drastic simplifications. Therefore, I am starting another binding project.. This time, the scope is intentionally limited so that this doesn't turn into an endless cycle.

Performance

TLDR: nanobind bindings compile ~2-3× faster, producing ~3× smaller binaries, with up to ~8× lower overheads on runtime performance (when comparing to pybind11 with -Os size optimizations).

The following experiments analyze the performance of a very large function-heavy (func) and class-heavy (class) binding microbenchmark compiled using Boost.Python, pybind11, and nanobind in both debug and size-optimized (opt) modes. A comparison with cppyy (which uses dynamic compilation) is also shown later. Details on the experimental setup can be found here.

The first plot contrasts the compilation time, where "number ×" annotations denote the amount of time spent relative to nanobind. As shown below, nanobind achieves a consistent ~2-3× improvement compared to pybind11.

nanobind also greatly reduces the binary size of the compiled bindings. There is a roughly 3× improvement compared to pybind11 and a 8-9× improvement compared to Boost.Python (both with size optimizations).

The last experiment compares the runtime performance overheads by calling one of the bound functions many times in a loop. Here, it is also interesting to compare against cppyy (gray bar) and a pure Python implementation that runs bytecode without binding overheads (hatched red bar).

This data shows that the overhead of calling a nanobind function is lower than that of an equivalent function call done within CPython. The functions benchmarked here don't perform CPU-intensive work, so this this mainly measures the overheads of performing a function call, boxing/unboxing arguments and return values, etc.

The difference to pybind11 is significant: a ~2× improvement for simple functions, and an ~8× improvement when classes are being passed around. Complexities in pybind11 related to overload resolution, multiple inheritance, and holders are the main reasons for this difference. Those features were either simplified or completely removed in nanobind.

Finally, there is a ~1.4× improvement in both experiments compared to cppyy (please ignore the two [debug] columns—I did not feel comfortable adjusting the JIT compilation flags; all cppyy bindings are therefore optimized.)

What are technical differences between nanobind and cppyy?

cppyy is based on dynamic parsing of C++ code and just-in-time (JIT) compilation of bindings via the LLVM compiler infrastructure. The authors of cppyy report that their tool produces bindings with much lower overheads compared to pybind11, and the above plots show that this is indeed true. However, nanobind retakes the performance lead in these experiments.

With speed gone as the main differentiating factor, other qualitative differences make these two tools appropriate to different audiences: cppyy has its origin in CERN's ROOT mega-project and must be highly dynamic to work with that codebase: it can parse header files to generate bindings as needed. cppyy works particularly well together with PyPy and can avoid boxing/unboxing overheads with this combination. The main downside of cppyy is that it depends on big and complex machinery (Cling/Clang/LLVM) that must be deployed on the user's side and then run there. There isn't a way of pre-generating bindings and then shipping just the output of this process.

nanobind is relatively static in comparison: you must tell it which functions to expose via binding declarations. These declarations offer a high degree of flexibility that users will typically use to create bindings that feel pythonic. At compile-time, those declarations turn into a sequence of CPython API calls, which produces self-contained bindings that are easy to redistribute via PyPI or elsewhere. Tools like cibuildwheel and scikit-build can fully automate the process of generating Python wheels for each target platform. A minimal example project shows how to do this automatically via GitHub Actions.

What are technical differences between nanobind and pybind11?

nanobind and pybind11 are the most similar of all of the binding tools compared above.

The main difference is between them is a change in philosophy: pybind11 must deal with all of C++ to bind complex legacy codebases, while nanobind targets a smaller C++ subset. The codebase has to adapt to the binding tool and not the other way around!, which allows nanobind to be simpler and faster. Pull requests with extensions and generalizations were welcomed in pybind11, but they will likely be rejected in this project.

An overview of removed features is provided in a separate document. Besides feature removal, the rewrite was also an opportunity to address long-standing performance issues in pybind11:

C++ objects are now co-located with the Python object whenever possible (less pointer chasing compared to pybind11). The per-instance overhead for wrapping a C++ type into a Python object shrinks by 2.3x. (pybind11: 56 bytes, nanobind: 24 bytes.)
C++ function binding information is now co-located with the Python function object (less pointer chasing).
C++ type binding information is now co-located with the Python type object (less pointer chasing, fewer hashtable lookups).
nanobind internally replaces std::unordered_map with a more efficient hash table (tsl::robin_map, which is included as a git submodule).
function calls from/to Python are realized using PEP 590 vector calls, which gives a nice speed boost. The main function dispatch loop no longer allocates heap memory.
pybind11 was designed as a header-only library, which is generally a good thing because it simplifies the compilation workflow. However, one major downside of this is that a large amount of redundant code has to be compiled in each binding file (e.g., the function dispatch loop and all of the related internal data structures). nanobind compiles a separate shared or static support library (libnanobind) and links it against the binding code to avoid redundant compilation. When using the CMake nanobind_add_module() function, this all happens transparently.
#include <pybind11/pybind11.h> pulls in a large portion of the STL (about 2.1 MiB of headers with Clang and libc++). nanobind minimizes STL usage to avoid this problem. Type casters even for for basic types like std::string require an explicit opt-in by including an extra header file (e.g. #include <nanobind/stl/string.h>).
pybind11 is dependent on link time optimization (LTO) to produce reasonably-sized bindings, which makes linking a build time bottleneck. With nanobind's split into a precompiled core library and minimal metatemplating, LTO is no longer important.
nanobind maintains efficient internal data structures for lifetime management (needed for nb::keep_alive, nb::rv_policy::reference_internal, the std::shared_ptr interface, etc.). With these changes, it is no longer necessary that bound types are weak-referenceable, which saves a pointer per instance.

Other improvements

Besides performance improvements, nanobind includes several quality-of-live improvements for developers:

nanobind has greatly improved support for exchanging CPU/GPU/TPU/.. tensor data structures with modern array programming frameworks.
nanobind can target Python's stable ABI interface starting with Python 3.12. This means that extension modules will eventually be compatible with any future version of Python without having to compile separate binaries per version. That vision is still far out, however: it will require Python 3.12+ to be widely deployed.
When the python interpreter shuts down, nanobind reports instance, type, and function leaks related to bindings, which is useful for tracking down reference counting issues.
nanobind deletes its internal data structures when the Python interpreter terminates, which avoids memory leak reports in tools like valgrind.
In pybind11, function docstrings are pre-rendered while the binding code runs. In other words, every call to .def(...) to bind a function immediately creates the underlying docstring. When a function takes a C++ type as parameter that is not yet registered in pybind11, the docstring will include the C++ type name (e.g. std::vector<int, std::allocator<int>>), which can look rather awkward. Avoiding this issue in pybind11 requires careful arrangement of binding declarations. nanobind avoids this issue by not pre-rendering function docstrings: they are created on the fly when queried.
nanobind docstrings have improved out-of-the-box compatibility with tools like Sphinx.

Dependencies

nanobind depends on recent versions of everything:

C++17: The if constexpr feature was crucial to simplify the internal meta-templating of this library.
Python 3.8+ or PyPy 7.3.10+ (either the 3.8 or 3.9 flavors): nanobind heavily relies on PEP 590 vector calls that were introduced in CPython version 3.8. Nanobind also works with recent versions of PyPy subject to certain limitations.
CMake 3.15+: Recent CMake versions include important improvements to FindPython that this project depends on.
Supported compilers: Clang 7, GCC 8, MSVC2019 (or newer) are officially supported.

Other compilers like MinGW, Intel (icpc, dpc++), NVIDIA (PGI, nvcc) may or may not work but aren't officially supported. Pull requests to work around bugs in these compilers will not be accepted, as similar changes introduced significant complexity in pybind11. Instead, please file bugs with the vendors so that they will fix their compilers.

CMake interface

nanobind integrates with CMake to simplify binding compilation. Please see the separate writeup for details.

The easiest way to get started is by cloning nanobind_example, which is a minimal project with nanobind-based bindings compiled via CMake and scikit-build. It also shows how to use GitHub Actions to deploy binary wheels for a variety of platforms.

API differences

nanobind mostly follows the pybind11 API, hence the pybind11 documentation is the main source of documentation for this project. A number of simplifications and noteworthy changes are detailed below.

Namespace. nanobind types and functions are located in the nanobind namespace. The namespace nb = nanobind; shorthand alias is recommended.
Macros. The PYBIND11_* macros (e.g., PYBIND11_OVERRIDE(..)) were renamed to NB_* (e.g., NB_OVERRIDE(..)).
Shared pointers and holders. nanobind removes the concept of a holder type, which caused inefficiencies and introduced complexity in pybind11. This has implications on object ownership, shared ownership, and interactions with C++ shared/unique pointers.

Please see the following separate page for the nitty-gritty details on shared and unique pointers. Classes with intrusive reference counting also continue to be supported, please see the linked page for details.

It is no longer necessary to specify holder types in the type declaration:

pybind11:
```
py::class_<MyType, std::shared_ptr<MyType>>(m, "MyType")
  ...
```
nanobind:
```
nb::class_<MyType>(m, "MyType")
  ...
```
Instead, use of shared/unique pointers requires including one or both of the following optional header files:
- nanobind/stl/unique_ptr.h
- nanobind/stl/shared_ptr.h
Binding functions that take std::unique_ptr<T> arguments involves some limitations that can be avoided by changing their signatures to use std::unique_ptr<T, nb::deleter<T>> instead. Usage of std::enable_shared_from_this<T> is prohibited and will raise a compile-time assertion. This is consistent with the philosophy of this library: the codebase has to adapt to the binding tool and not the other way around.
Null pointers. In contrast to pybind11, nanobind by default does not permit None-valued arguments during overload resolution. They need to be enabled explicitly using the .none() member of an argument annotation.
```
    .def("func", &func, "arg"_a.none());
```
It is also possible to set a None default value as follows (the .none() annotation can be omitted in this special case):
```
    .def("func", &func, "arg"_a = nb::none());
```

Implicit type conversions. In pybind11, implicit conversions were specified using a follow-up function call. In nanobind, they are specified within the constructor declarations:

pybind11:

py::class_<MyType>(m, "MyType")
    .def(py::init<MyOtherType>());

py::implicitly_convertible<MyOtherType, MyType>();

nanobind:

nb::class_<MyType>(m, "MyType")
    .def(nb::init_implicit<MyOtherType>());

Custom constructors: In pybind11, custom constructors (i.e. ones that do not already exist in the C++ class) could be specified as lambda function returning an instance of the desired type.
```
nb::class_<MyType>(m, "MyType")
    .def(nb::init([](int) { return MyType(...); }));
```
Unfortunately, the implementation of this feature was quite complex and often required involved further internal calls to the move or copy constructor. nanobind instead reverts to how pybind11 originally implemented this feature using in-place construction ("placement new"):
```
nb::class_<MyType>(m, "MyType")
    .def("__init__", [](MyType *t) { new (t) MyType(...); });
```
The provided lambda function will be called with a pointer to uninitialized memory that has already been allocated (this memory region is co-located with the Python object for reasons of efficiency). The lambda function can then either run an in-place constructor and return normally (in which case the instance is assumed to be correctly constructed) or fail by raising an exception.
Trampoline classes. Trampolines, i.e., polymorphic class implementations that forward virtual function calls to Python, now require an extra NB_TRAMPOLINE(parent, size) declaration, where parent refers to the parent class and size is at least as big as the number of NB_OVERRIDE_*() calls. nanobind caches information to enable efficient function dispatch, for which it must know the number of trampoline "slots". Example:
```
struct PyAnimal : Animal {
    NB_TRAMPOLINE(Animal, 1);

    std::string name() const override {
        NB_OVERRIDE(std::string, Animal, name);
    }
};
```
Trampoline declarations with an insufficient size may eventually trigger a Python RuntimeError exception with a descriptive label, e.g. nanobind::detail::get_trampoline('PyAnimal::what()'): the trampoline ran out of slots (you will need to increase the value provided to the NB_TRAMPOLINE() macro)!.
Type casters. The API of custom type casters has changed significantly. In a nutshell, the following changes are needed:
- load() was renamed to from_python(). The function now takes an extra uint8_t flags (instead bool convert, which is now represented by the flag nanobind::detail::cast_flags::convert). A cleanup_list * pointer keeps track of Python temporaries that are created by the conversion, and which need to be deallocated after a function call has taken place. flags and cleanup should be passed to any recursive usage of type_caster::from_python(). If casting fails due to a Python exception, the function should clear it (PyErr_Clear()) and return false. If a severe error condition arises that should be reported, use Python warning API calls for this, e.g. PyErr_WarnFormat().
- cast() was renamed to from_cpp(). The function takes a return value policy (as before) and a cleanup_list * pointer. If casting fails due to a Python exception, the function should leave the error set (note the asymmetry compared to from_python()) and return nullptr.
Both functions must be marked as noexcept.

Note that the cleanup list is only available when from_python() or from_cpp() are called as part of function dispatch, while usage by nanobind::cast() sets cleanup to nullptr. This case should be handled gracefully by refusing the conversion if the cleanup list is absolutely required.

The std::pair<..> type caster may be useful as a reference for these changes.
Use of the nb::make_iterator(), nb::make_key_iterator(), and nb::make_value_iterator() requires including the additional header file nanobind/make_iterator.h. The interface of these functions has also slightly changed: all take a Python scope and a name as first and second arguments, which are used to permanently "install" the iterator type (which is created on demand). See the test suite for a worked out example.
The following types and functions were renamed:

pybind11 nanobind

error_already_set python_error

type::of<T> type<T>

type type_object

reinterpret_borrow borrow

reinterpret_steal steal
New features.
- Unified DLPack/Buffer protocol integration: nanobind can retrieve and return tensors using two standard protocols: DLPack, and the the buffer protocol. This enables zero-copy data exchange of CPU and GPU tensors with array programming frameworks including NumPy, PyTorch, TensorFlow, JAX, etc.
  
  Details on using this feature can be found here.
- Supplemental type data: nanobind can store supplemental data along with registered types. An example use of this fairly advanced feature are libraries that register large numbers of different types (e.g. flavors of tensors). A single generically implemented function can then query this supplemental information to handle each type slightly differently.
```
struct Supplement {
    ... // should be a POD (plain old data) type
};

// Register a new type Test, and reserve space for sizeof(Supplement)
nb::class_<Test> cls(m, "Test", nb::supplement<Supplement>(), nb::is_final())

/// Mutable reference to 'Supplement' portion in Python type object
Supplement &supplement = nb::type_supplement<Supplement>(cls);
```
  The supplement is not propagated to subclasses created within Python. Such types should therefore be created with nb::is_final().
- Low-level interface: nanobind exposes a low-level interface to provide fine-grained control over the sequence of steps that instantiates a Python object wrapping a C++ instance. Like the above point, this is useful when writing generic binding code that manipulates nanobind-based objects of various types.
  
  Details on using this feature can be found here.
- Python type wrappers: The nb::handle_t<T> type behaves just like the nb::handle class and wraps a PyObject * pointer. However, when binding a function that takes such an argument, nanobind will only call the associated function overload when the underlying Python object wraps a C++ instance of type T.
  
  Siimlarly, the nb::type_object_t<T> type behaves just like the nb::type_object class and wraps a PyTypeObject * pointer. However, when binding a function that takes such an argument, nanobind will only call the associated function overload when the underlying Python type object is a subtype of the C++ type T.
- Finding Python objects associated with a C++ instance: In addition to all of the return value policies supported by pybind11, nanobind provides one additional policy named nb::rv_policy::none that only succeeds when the return value is already a known/registered Python object. In other words, this policy will never attempt to move, copy, or reference a C++ instance by constructing a new Python object.
  
  The new nb::find() function encapsulates this behavior. It resembles nb::cast() in the sense that it returns the Python object associated with a C++ instance. But while nb::cast() will create that Python object if it doesn't yet exist, nb::find() will return a nullptr object.
- Customizing types: The pybind11 custom_type_setup annotation that enabled ad-hoc write access to a constructed Python type object was replaced with the limited API-compatible nb::type_slots interface. For an example of using this feature to fully integrate nanobind with Python's cyclic garbage collector, see the separate page on this topic.
- Raw docstrings: In cases where absolute control over docstrings is required (for example, so that complex cases can be parsed by a tool like Sphinx), the nb::raw_doc attribute can be specified to functions. In this case, nanobind will skip generation of a combined docstring that enumerates overloads along with type information.
  
  Example:
```
m.def("identity", [](float arg) { return arg; });
m.def("identity", [](int arg) { return arg; },
      nb::raw_doc(
          "identity(arg)\n"
          "An identity function for integers and floats\n"
          "\n"
          "Args:\n"
          "    arg (float | int): Input value\n"
          "\n"
          "Returns:\n"
          "    float | int: Result of the identity operation"));
```
  Writing detailed docstrings in this way is rather tedious. In practice, they would usually be extracted from C++ heades using a tool like pybind11_mkdoc.

pybind11	nanobind
`error_already_set`	`python_error`
`type::of<T>`	`type<T>`
`type`	`type_object`
`reinterpret_borrow`	`borrow`
`reinterpret_steal`	`steal`

How to cite this project?

Please use the following BibTeX template to cite nanobind in scientific discourse:

@misc{nanobind,
   author = {Wenzel Jakob},
   year = {2022},
   note = {https://github.com/wjakob/nanobind},
   title = {nanobind -- Seamless operability between C++17 and Python}
}

hawkinsp / nanobind