Skip to content

Commit a6ec69e

Browse files
committed
manual edit to fix rendering errors and prune outdated files
1 parent 051cb1a commit a6ec69e

File tree

16 files changed

+66
-52
lines changed

16 files changed

+66
-52
lines changed
-49.4 KB
Binary file not shown.
-47.7 KB
Binary file not shown.

cuda_bindings/docs/source/environment_variables.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
Environment Variables
55
=====================
66

7+
Runtime Environment Variables
8+
-----------------------------
9+
10+
- ``CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM`` : When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See `Stream Synchronization Behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html>`_ for an explanation of the legacy and per-thread default streams.
11+
12+
713
Build-Time Environment Variables
814
--------------------------------
915

@@ -13,7 +19,3 @@ Build-Time Environment Variables
1319

1420
- ``CUDA_PYTHON_PARALLEL_LEVEL`` (previously ``PARALLEL_LEVEL``) : int, sets the number of threads used in the compilation of extension modules. Not setting it or setting it to 0 would disable parallel builds.
1521

16-
Runtime Environment Variables
17-
-----------------------------
18-
19-
- ``CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM`` : When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See `Stream Synchronization Behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html>`_ for an explanation of the legacy and per-thread default streams.

cuda_bindings/docs/source/install.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,11 @@ Installing from PyPI
2727
2828
$ pip install -U cuda-python
2929
30-
Install all optional dependencies with::
30+
Install all optional dependencies with:
3131

3232
.. code-block:: console
3333
34-
pip install -U cuda-python[all]
34+
$ pip install -U cuda-python[all]
3535
3636
Where the optional dependencies include:
3737

@@ -53,7 +53,7 @@ Installing from Conda
5353

5454
When using conda, the ``cuda-version`` metapackage can be used to control the versions of CUDA Toolkit components that are installed to the conda environment.
5555

56-
For example::
56+
For example:
5757

5858
.. code-block:: console
5959
@@ -72,7 +72,7 @@ Requirements
7272

7373
[^2]: The CUDA Runtime static library (``libcudart_static.a`` on Linux, ``cudart_static.lib`` on Windows) is part of the CUDA Toolkit. If using conda packages, it is contained in the ``cuda-cudart-static`` package.
7474

75-
Source builds require that the provided CUDA headers are of the same major.minor version as the ``cuda.bindings`` you're trying to build. Despite this requirement, note that the minor version compatibility is still maintained. Use the ``CUDA_HOME`` (or ``CUDA_PATH``) environment variable to specify the location of your headers. For example, if your headers are located in ``/usr/local/cuda/include``, then you should set ``CUDA_HOME`` with::
75+
Source builds require that the provided CUDA headers are of the same major.minor version as the ``cuda.bindings`` you're trying to build. Despite this requirement, note that the minor version compatibility is still maintained. Use the ``CUDA_HOME`` (or ``CUDA_PATH``) environment variable to specify the location of your headers. For example, if your headers are located in ``/usr/local/cuda/include``, then you should set ``CUDA_HOME`` with:
7676

7777
.. code-block:: console
7878
@@ -87,7 +87,7 @@ See `Environment Variables <environment_variables.rst>`_ for a description of ot
8787
Editable Install
8888
^^^^^^^^^^^^^^^^
8989

90-
You can use::
90+
You can use:
9191

9292
.. code-block:: console
9393

cuda_bindings/docs/source/overview.rst

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ import this dependency as well.
4848
import numpy as np
4949
5050
Error checking is a fundamental best practice when working with low-level interfaces.
51-
The following code snippet lets us validate each API call and raise exceptions in case of error.::
51+
The following code snippet lets us validate each API call and raise exceptions in case of error:
5252

5353
.. code-block:: python
5454
@@ -99,7 +99,7 @@ Go ahead and compile the kernel into PTX. Remember that this is executed at runt
9999

100100
In the following code example, the Driver API is initialized so that the NVIDIA driver
101101
and GPU are accessible. Next, the GPU is queried for their compute capability. Finally,
102-
the program is compiled to target our local compute capability architecture with FMAD disabled.::
102+
the program is compiled to target our local compute capability architecture with FMAD disabled:
103103

104104
.. code-block:: python
105105
@@ -129,7 +129,7 @@ the program is compiled to target our local compute capability architecture with
129129
Before you can use the PTX or do any work on the GPU, you must create a CUDA
130130
context. CUDA contexts are analogous to host processes for the device. In the
131131
following code example, a handle for compute device 0 is passed to
132-
``cuCtxCreate`` to designate that GPU for context creation.::
132+
``cuCtxCreate`` to designate that GPU for context creation:
133133

134134
.. code-block:: python
135135
@@ -139,7 +139,7 @@ following code example, a handle for compute device 0 is passed to
139139
With a CUDA context created on device 0, load the PTX generated earlier into a
140140
module. A module is analogous to dynamically loaded libraries for the device.
141141
After loading into the module, extract a specific kernel with
142-
``cuModuleGetFunction``. It is not uncommon for multiple kernels to reside in PTX.::
142+
``cuModuleGetFunction``. It is not uncommon for multiple kernels to reside in PTX:
143143

144144
.. code-block:: python
145145
@@ -152,7 +152,7 @@ After loading into the module, extract a specific kernel with
152152
Next, get all your data prepared and transferred to the GPU. For increased
153153
application performance, you can input data on the device to eliminate data
154154
transfers. For completeness, this example shows how you would transfer data to
155-
and from the device.::
155+
and from the device:
156156

157157
.. code-block:: python
158158
@@ -175,7 +175,7 @@ execution.
175175

176176
Python doesn't have a natural concept of pointers, yet ``cuMemcpyHtoDAsync`` expects
177177
``void*``. This is where we leverage NumPy's data types to retrieve each host data pointer
178-
by calling ``XX.ctypes.data`` for the associated XX.::
178+
by calling ``XX.ctypes.data`` for the associated XX:
179179

180180
.. code-block:: python
181181
@@ -196,7 +196,7 @@ With data prep and resources allocation finished, the kernel is ready to be
196196
launched. To pass the location of the data on the device to the kernel execution
197197
configuration, you must retrieve the device pointer. In the following code
198198
example, we call ``int(XXclass)`` to retrieve the device pointer value for the
199-
associated XXclass as a Python ``int`` and wrap it in a ``np.array`` type.::
199+
associated XXclass as a Python ``int`` and wrap it in a ``np.array`` type:
200200

201201
.. code-block:: python
202202
@@ -209,14 +209,14 @@ but this time it's of type ``void**``. What this means is that our argument list
209209
be a contiguous array of ``void*`` elements, where each element is the pointer to a kernel
210210
argument on either host or device. Since we already prepared each of our arguments into a ``np.array`` type, the
211211
construction of our final contiguous array is done by retrieving the ``XX.ctypes.data``
212-
of each kernel argument.::
212+
of each kernel argument:
213213

214214
.. code-block:: python
215215
216216
args = [a, dX, dY, dOut, n]
217217
args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
218218
219-
Now the kernel can be launched::
219+
Now the kernel can be launched:
220220

221221
.. code-block:: python
222222
@@ -245,7 +245,7 @@ data transfers. That ensures that the kernel's compute is performed only after
245245
the data has finished transfer, as all API calls and kernel launches within a
246246
stream are serialized. After the call to transfer data back to the host is
247247
executed, ``cuStreamSynchronize`` is used to halt CPU execution until all operations
248-
in the designated stream are finished.::
248+
in the designated stream are finished:
249249

250250
.. code-block:: python
251251
@@ -255,7 +255,7 @@ in the designated stream are finished.::
255255
raise ValueError("Error outside tolerance for host-device vectors")
256256
257257
Perform verification of the data to ensure correctness and finish the code with
258-
memory clean up.::
258+
memory clean up:
259259

260260
.. code-block:: python
261261
@@ -277,7 +277,7 @@ kernel performance and `CUDA
277277
Events <https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/>`_
278278
was used for application performance.
279279

280-
The following command was used to profile the applications::
280+
The following command was used to profile the applications:
281281

282282
.. code-block:: shell
283283
@@ -323,7 +323,7 @@ Using NumPy
323323

324324
NumPy `Array objects <https://numpy.org/doc/stable/reference/arrays.html>`_ can be used to fulfill each of these conditions directly.
325325

326-
Let's use the following kernel definition as an example::
326+
Let's use the following kernel definition as an example:
327327

328328
.. code-block:: python
329329
@@ -404,7 +404,7 @@ This example uses the following types:
404404

405405
Note how all three pointers are ``np.intp`` since the pointer values are always a representation of an address space.
406406

407-
Putting it all together::
407+
Putting it all together:
408408

409409
.. code-block:: python
410410
@@ -429,13 +429,13 @@ Putting it all together::
429429
The final step is to construct a ``kernelParams`` argument that fulfills all of the launch API conditions. This is made easy because each array object comes
430430
with a `ctypes <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes>`_ data attribute that returns the underlying ``void*`` pointer value.
431431

432-
By having the final array object contain all pointers, we fulfill the contiguous array requirement::
432+
By having the final array object contain all pointers, we fulfill the contiguous array requirement:
433433

434434
.. code-block:: python
435435
436436
kernelParams = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
437437
438-
The launch API supports `Buffer Protocol <https://docs.python.org/3/c-api/buffer.html>`_ objects, therefore we can pass the array object directly.::
438+
The launch API supports `Buffer Protocol <https://docs.python.org/3/c-api/buffer.html>`_ objects, therefore we can pass the array object directly:
439439

440440
.. code-block:: python
441441
@@ -463,7 +463,7 @@ The ctypes approach treats the ``kernelParams`` argument as a pair of two tuples
463463
The ctypes `fundamental data types <https://docs.python.org/3/library/ctypes.html#fundamental-data-types>`_ documentation describes the compatibility between different Python types and C types.
464464
Furthermore, `custom data types <https://docs.python.org/3/library/ctypes.html#calling-functions-with-your-own-custom-data-types>`_ can be used to support kernels with custom types.
465465

466-
For this example the result becomes::
466+
For this example the result becomes:
467467

468468
.. code-block:: python
469469
@@ -502,7 +502,7 @@ Values that are set to ``None`` have a special meaning:
502502

503503
In all three cases, the API call will fetch the underlying pointer value and construct a contiguous array with other kernel parameters.
504504

505-
With the setup complete, the kernel can be launched::
505+
With the setup complete, the kernel can be launched:
506506

507507
.. code-block:: python
508508
@@ -520,7 +520,7 @@ CUDA objects
520520

521521
Certain CUDA kernels use native CUDA types as their parameters such as ``cudaTextureObject_t``. These types require special handling since they're neither a primitive ctype nor a custom user type. Since ``cuda.bindings`` exposes each of them as Python classes, they each implement ``getPtr()`` and ``__int__()``. These two callables used to support the NumPy and ctypes approach. The difference between each call is further described under `Tips and Tricks <https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#>`_.
522522

523-
For this example, lets use the ``transformKernel`` from `examples/0_Introduction/simpleCubemapTexture_test.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py>`_::
523+
For this example, lets use the ``transformKernel`` from `examples/0_Introduction/simpleCubemapTexture_test.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py>`_:
524524

525525
.. code-block:: python
526526
@@ -539,7 +539,7 @@ For this example, lets use the ``transformKernel`` from `examples/0_Introduction
539539
tex = checkCudaErrors(cudart.cudaCreateTextureObject(texRes, texDescr, None))
540540
...
541541
542-
For NumPy, we can convert these CUDA types by leveraging the ``__int__()`` call to fetch the address of the underlying ``cudaTextureObject_t`` C object and wrapping it in a NumPy object array of type ``np.intp``::
542+
For NumPy, we can convert these CUDA types by leveraging the ``__int__()`` call to fetch the address of the underlying ``cudaTextureObject_t`` C object and wrapping it in a NumPy object array of type ``np.intp``:
543543

544544
.. code-block:: python
545545
@@ -550,7 +550,7 @@ For NumPy, we can convert these CUDA types by leveraging the ``__int__()`` call
550550
)
551551
kernelArgs = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
552552
553-
For ctypes, we leverage the special handling of ``None`` type since each Python class already implements ``getPtr()``::
553+
For ctypes, we leverage the special handling of ``None`` type since each Python class already implements ``getPtr()``:
554554

555555
.. code-block:: python
556556

cuda_bindings/docs/source/release/11.8.6-notes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Optional dependencies are added for packages:
2222
- nvidia-cuda-nvrtc-cu12
2323

2424
Installing these dependencies with ``cuda-python`` can be done using:
25+
2526
.. code-block:: shell
2627
2728
pip install cuda-python[all]

cuda_bindings/docs/source/release/12.8.0-notes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Optional dependencies are added for packages:
2424
- nvidia-nvjitlink-cu12
2525

2626
Installing these dependencies with ``cuda-python`` can be done using:
27+
2728
.. code-block:: shell
2829
2930
pip install cuda-python[all]

cuda_bindings/docs/source/tips_and_tricks.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@ Tips and Tricks
77
Getting the address of underlying C objects from the low-level bindings
88
=======================================================================
99

10-
All CUDA C types are exposed to Python as Python classes. For example, the :class:`~cuda.bindings.driver.CUstream` type is exposed as a class with methods :meth:`~cuda.bindings.driver.CUstream.getPtr()` and :meth:`~cuda.bindings.driver.CUstream.__int__()` implemented.
11-
12-
There is an important distinction between the ``getPtr()`` method and the behaviour of ``__int__()``. Since a ``CUstream`` is itself just a pointer, calling ``instance_of_CUstream.getPtr()`` returns the pointer *to* the pointer, instead of the value of the ``CUstream`` C object that is the pointer to the underlying stream handle. ``int(instance_of_CUstream)`` returns the value of the ``CUstream`` converted to a Python int and is the actual address of the underlying handle.
13-
1410
.. warning::
1511

1612
Using ``int(cuda_obj)`` to retrieve the underlying address of a CUDA object is deprecated and
1713
subject to future removal. Please switch to use :func:`~cuda.bindings.utils.get_cuda_native_handle`
1814
instead.
1915

16+
All CUDA C types are exposed to Python as Python classes. For example, the :class:`~cuda.bindings.driver.CUstream` type is exposed as a class with methods :meth:`~cuda.bindings.driver.CUstream.getPtr()` and :meth:`~cuda.bindings.driver.CUstream.__int__()` implemented.
17+
18+
There is an important distinction between the ``getPtr()`` method and the behaviour of ``__int__()``. Since a ``CUstream`` is itself just a pointer, calling ``instance_of_CUstream.getPtr()`` returns the pointer *to* the pointer, instead of the value of the ``CUstream`` C object that is the pointer to the underlying stream handle. ``int(instance_of_CUstream)`` returns the value of the ``CUstream`` converted to a Python int and is the actual address of the underlying handle.
19+
2020

2121
Lifetime management of the CUDA objects
2222
=======================================
-49.4 KB
Binary file not shown.
-47.7 KB
Binary file not shown.

0 commit comments

Comments
 (0)