Deploy Sphinx documentation

NVIDIA · Feb 9, 2024 · 9c5c9a3 · 9c5c9a3
commit 9c5c9a3
Show file tree

Hide file tree

Showing 69 changed files with 29,361 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: fd099f8b4928d9493ca64793d2d5e7fd
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/compiler_pipeline.png b/_images/compiler_pipeline.png
diff --git a/_images/header.png b/_images/header.png
diff --git a/_images/joint_transforms.png b/_images/joint_transforms.png
diff --git a/_sources/basics.rst.txt b/_sources/basics.rst.txt
@@ -0,0 +1,214 @@
+Basics
+======
+
+Initialization
+--------------
+
+Before use Warp should be explicitly initialized with the ``wp.init()`` method as follows::
+
+    import warp as wp
+
+    wp.init()
+
+Warp will print some startup information about the compute devices available, driver versions, and the location
+for any generated kernel code, e.g.:
+
+.. code:: bat
+
+    Warp 1.0.0 initialized:
+    CUDA Toolkit: 11.8, Driver: 12.1
+    Devices:
+        "cpu"    | AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
+        "cuda:0" | NVIDIA GeForce RTX 4080 (sm_89)
+    Kernel cache: C:\Users\mmacklin\AppData\Local\NVIDIA\warp\Cache\1.0.0
+
+
+Kernels
+-------
+
+In Warp, compute kernels are defined as Python functions and annotated with the ``@wp.kernel`` decorator, as follows::
+
+    @wp.kernel
+    def simple_kernel(a: wp.array(dtype=wp.vec3),
+                      b: wp.array(dtype=wp.vec3),
+                      c: wp.array(dtype=float)):
+
+        # get thread index
+        tid = wp.tid()
+
+        # load two vec3s
+        x = a[tid]
+        y = b[tid]
+
+        # compute the dot product between vectors
+        r = wp.dot(x, y)
+
+        # write result back to memory
+        c[tid] = r
+
+Because Warp kernels are compiled to native C++/CUDA code, all the function input arguments must be statically typed. This allows 
+Warp to generate fast code that executes at essentially native speeds. Because kernels may be run on either the CPU
+or GPU, they cannot access arbitrary global state from the Python environment. Instead they must read and write data
+through their input parameters such as arrays.
+
+Warp kernels functions have a 1:1 correspondence with CUDA kernels, to launch a kernel with 1024 threads, we use
+:func:`wp.launch() <warp.launch>` as follows::
+
+    wp.launch(kernel=simple_kernel, # kernel to launch
+              dim=1024,             # number of threads
+              inputs=[a, b, c],     # parameters
+              device="cuda")        # execution device
+
+Inside the kernel we can retrieve the *thread index* of the each thread using the ``wp.tid()`` builtin function::
+
+    # get thread index
+    i = wp.tid()
+
+Kernels can be launched with 1D, 2D, 3D, or 4D grids of threads, e.g.: to launch a 2D grid of threads to process a 1024x1024 image we could write::
+
+    wp.launch(kernel=compute_image, 
+              dim=(1024, 1024),       
+              inputs=[img],     
+              device="cuda")
+
+Then, inside the kernel we can retrieve a 2D thread index as follows::
+
+    # get thread index
+    i, j = wp.tid()
+
+    # write out a color value for each pixel
+    color[i, j] = wp.vec3(r, g, b)
+
+.. _example-cache-management:
+
+Example: Changing the kernel cache directory
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following example illustrates how the location for generated and compiled
+kernel code can be changed before and after calling ``wp.init()``.
+
+.. code:: python
+
+    import os
+
+    import warp as wp
+
+    example_dir = os.path.dirname(os.path.realpath(__file__))
+
+    # set default cache directory before wp.init()
+    wp.config.kernel_cache_dir = os.path.join(example_dir, "tmp", "warpcache1")
+
+    wp.init()
+
+    print("+++ Current cache directory: ", wp.config.kernel_cache_dir)
+
+    # change cache directory after wp.init()
+    wp.build.init_kernel_cache(os.path.join(example_dir, "tmp", "warpcache2"))
+
+    print("+++ Current cache directory: ", wp.config.kernel_cache_dir)
+
+    # clear kernel cache (forces fresh kernel builds every time)
+    wp.build.clear_kernel_cache()
+
+
+    @wp.kernel
+    def basic(x: wp.array(dtype=float)):
+        tid = wp.tid()
+        x[tid] = float(tid)
+
+
+    device = "cpu"
+    n = 10
+    x = wp.zeros(n, dtype=float, device=device)
+
+    wp.launch(kernel=basic, dim=n, inputs=[x], device=device)
+    print(x.numpy())
+
+Arrays
+------
+
+Memory allocations are exposed via the ``wp.array`` type. Arrays wrap an underlying memory allocation that may live in
+either host (CPU), or device (GPU) memory. Arrays are strongly typed and store a linear sequence of built-in values
+(``float,``, ``int``, ``vec3``, ``matrix33``, etc).
+
+Arrays can be allocated similar to PyTorch::
+
+    # allocate an uninitialized array of vec3s
+    v = wp.empty(shape=n, dtype=wp.vec3, device="cuda")
+
+    # allocate a zero-initialized array of quaternions    
+    q = wp.zeros(shape=n, dtype=wp.quat, device="cuda")
+
+    # allocate and initialize an array from a NumPy array
+    # will be automatically transferred to the specified device
+    a = np.ones((10, 3), dtype=np.float32)
+    v = wp.from_numpy(a, dtype=wp.vec3, device="cuda")
+
+By default, Warp arrays that are initialized from external data (e.g.: NumPy, Lists, Tuples) will create a copy the data to new memory for the
+device specified. However, it is possible for arrays to alias external memory using the ``copy=False`` parameter to the
+array constructor provided the input is contiguous and on the same device. See the :doc:`/modules/interoperability`
+section for more details on sharing memory with external frameworks.
+
+To read GPU array data back to CPU memory we can use the ``array.numpy()`` method::
+
+    # bring data from device back to host
+    view = device_array.numpy()
+
+This will automatically synchronize with the GPU to ensure that any outstanding work has finished, and will
+copy the array back to CPU memory where it is passed to NumPy. Calling ``array.numpy()`` on a CPU array will return
+a zero-copy NumPy view onto the Warp data.
+
+User Functions
+--------------
+
+Users can write their own functions using the ``@wp.func`` decorator, for example::
+
+    @wp.func
+    def square(x: float):
+        return x*x
+
+User functions can be called freely from within kernels inside the same module and accept arrays as inputs. 
+
+Compilation Model
+-----------------
+
+Warp uses a Python->C++/CUDA compilation model that generates kernel code from Python function definitions. All kernels belonging to a Python module are runtime compiled into dynamic libraries and PTX, the result is then cached between application restarts for fast startup times.
+
+Note that compilation is triggered on the first kernel launch for that module. Any kernels registered in the module with ``@wp.kernel`` will be included in the shared library.
+
+.. image:: ./img/compiler_pipeline.png
+
+
+Language Details
+----------------
+
+To support GPU computation and differentiability, there are some differences from the CPython runtime.
+
+Built-in Types
+^^^^^^^^^^^^^^
+
+Warp supports a number of built-in math types similar to high-level shading languages, for example ``vec2, vec3, vec4, mat22, mat33, mat44, quat, array``. All built-in types have value semantics so that expressions such as ``a = b`` generate a copy of the variable b rather than a reference.
+
+Strong Typing
+^^^^^^^^^^^^^
+
+Unlike Python, in Warp all variables must be typed. Types are inferred from source expressions and function signatures using the Python typing extensions. All kernel parameters must be annotated with the appropriate type, for example: ::
+
+    @wp.kernel
+    def simple_kernel(a: wp.array(dtype=vec3),
+                      b: wp.array(dtype=vec3),
+                      c: float):
+
+Tuple initialization is not supported, instead variables should be explicitly typed: ::
+
+    # invalid
+    a = (1.0, 2.0, 3.0)        
+
+    # valid
+    a = wp.vec3(1.0, 2.0, 3.0) 
+
+
+Limitations and Unsupported Features
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+See :doc:`limitations` for a list of Warp limitations and unsupported features.
diff --git a/_sources/configuration.rst.txt b/_sources/configuration.rst.txt
@@ -0,0 +1,138 @@
+Runtime Settings
+================
+
+Warp has settings at the global, module, and kernel level that can be used to fine-tune the compilation and verbosity
+of Warp programs. In cases in which a setting can be changed at multiple levels (e.g ``enable_backward``),
+the setting at the more-specific scope takes precedence.
+
+Global Settings
+---------------
+
+To change a setting, prepend ``wp.config.`` to the name of the variable and assign a value to it.
+Some settings may be changed on the fly, while others need to be set prior to calling ``wp.init()`` to take effect.
+
+For example, the location of the user kernel cache can be changed with:
+
+.. code-block:: python
+
+    import os
+
+    import warp as wp
+
+    example_dir = os.path.dirname(os.path.realpath(__file__))
+
+    # set default cache directory before wp.init()
+    wp.config.kernel_cache_dir = os.path.join(example_dir, "tmp", "warpcache1")
+
+    wp.init()
+
+
+Basic Global Settings
+^^^^^^^^^^^^^^^^^^^^^
+
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+| Field              | Type    |Default Value| Description                                                              |
++====================+=========+=============+==========================================================================+
+|``verify_fp``       | Boolean | ``False``   | If ``True``, Warp will check that inputs and outputs are finite before   |
+|                    |         |             | and/or after various operations. **Has performance implications.**       |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``verify_cuda``     | Boolean | ``False``   | If ``True``, Warp will check for CUDA errors after every launch and      |
+|                    |         |             | memory operation. CUDA error verification cannot be used during graph    |
+|                    |         |             | capture. **Has performance implications.**                               |              
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``print_launches``  | Boolean | ``False``   | If ``True``, Warp will print details of every kernel launch to standard  |
+|                    |         |             | out (e.g. launch dimensions, inputs, outputs, device, etc.).             |
+|                    |         |             | **Has performance implications.**                                        |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``mode``            | String  |``"release"``| Controls whether to compile Warp kernels in debug or release mode.       |
+|                    |         |             | Valid choices are ``"release"`` or ``"debug"``.                          |
+|                    |         |             | **Has performance implications.**                                        |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``verbose``         | Boolean | ``False``   | If ``True``, additional information will be printed to standard out      |
+|                    |         |             | during code generation, compilation, etc.                                |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``quiet``           | Boolean | ``False``   | If ``True``, Warp module initialization messages will be disabled.       |
+|                    |         |             | This setting does not affect error messages and warnings.                |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``kernel_cache_dir``| String  | ``None``    | The path to the directory used for the user kernel cache. Subdirectories |
+|                    |         |             | named ``gen`` and ``bin`` will be created in this directory. If ``None``,|
+|                    |         |             | a directory will be automatically determined using                       |
+|                    |         |             | `appdirs.user_cache_directory <https://github.com/ActiveState/appdirs>`_ |
+|                    |         |             |                                                                          |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``enable_backward`` | Boolean | ``True``    | If ``True``, backward passes of kernels will be compiled by default.     |
+|                    |         |             | Disabling this setting can reduce kernel compilation times.              |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+
+Advanced Global Settings
+^^^^^^^^^^^^^^^^^^^^^^^^
+
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+| Field              | Type    |Default Value| Description                                                              |
++====================+=========+=============+==========================================================================+
+|``cache_kernels``   | Boolean | ``True``    | If ``True``, kernels that have already been compiled from previous       |
+|                    |         |             | application launches will not be recompiled.                             |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``cuda_output``     | String  | ``None``    | The preferred CUDA output format for kernels. Valid choices are ``None``,|
+|                    |         |             | ``"ptx"``, and ``"cubin"``. If ``None``, a format will be determined     |
+|                    |         |             | automatically.                                                           |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``ptx_target_arch`` | Integer | 70          | The target architecture for PTX generation.                              |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``llvm_cuda``       | Boolean | ``False``   | If ``True``, Clang/LLVM will be used to compile CUDA code instead of     |
+|                    |         |             | NVTRC.                                                                   |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+
+Module Settings
+---------------
+
+Module-level settings to control runtime compilation and code generation may be changed by passing a dictionary of
+option pairs to ``wp.set_module_options()``.
+
+For example, compilation of backward passes for the kernel in an entire module can be disabled with:
+
+.. code:: python
+
+    wp.set_module_options({"enable_backward": False})
+
+The options for a module can also be queried using ``wp.get_module_options()``.
+
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+| Field              | Type    |Default Value| Description                                                              |
++====================+=========+=============+==========================================================================+
+|``mode``            | String  | Global      | Controls whether to compile the module's kernels in debug or release     |
+|                    |         | setting     | mode by default. Valid choices are ``"release"`` or ``"debug"``.         |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``max_unroll``      | Integer | 16          | The maximum fixed-size loop to unroll. Note that ``max_unroll`` does not |
+|                    |         |             | consider the total number of iterations in nested loops. This can result |
+|                    |         |             | in a large amount of automatically generated code if each nested loop is |
+|                    |         |             | below the ``max_unroll`` threshold.                                      |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``enable_backward`` | Boolean | Global      | If ``True``, backward passes of kernels will be compiled by default.     |
+|                    |         | setting     | Valid choices are ``"release"`` or ``"debug"``.                          |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``fast_math``       | Boolean | ``False``   | If ``True``, CUDA kernels will be compiled with the ``--use_fast_math``  |
+|                    |         |             | compiler option, which enables some fast math operations that are faster |
+|                    |         |             | but less accurate.                                                       |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+|``cuda_output``     | String  | ``None``    | The preferred CUDA output format for kernels. Valid choices are ``None``,|
+|                    |         |             | ``"ptx"``, and ``"cubin"``. If ``None``, a format will be determined     |
+|                    |         |             | automatically. The module-level setting takes precedence over the global |
+|                    |         |             | setting.                                                                 |
++--------------------+---------+-------------+--------------------------------------------------------------------------+
+
+Kernel Settings
+---------------
+
+``enable_backward`` is currently the only setting that can also be configured on a per-kernel level.
+Backward-pass compilation can be disabled by passing an argument into the ``@wp.kernel`` decorator
+as in the following example:
+
+.. code-block:: python
+
+    @wp.kernel(enable_backward=False)
+    def scale_2(
+        x: wp.array(dtype=float),
+        y: wp.array(dtype=float),
+    ):
+        y[0] = x[0] ** 2.0