Skip to content

Releases: eyalroz/cuda-api-wrappers

Version 0.8.0 Release Candidate 1: nvFatbin , CUDA 12.x features, sync-async op unification, static targets etc.

18 Nov 15:39
Compare
Choose a tag to compare

Changes since v0.7.1:

Support for the nvFatbin library (#681)

  • The API wrappers now support NVIDIA's "fat binary" file format creation/marshalling library, nvFatbin. It is supported via a cuda::fatbin_builder_t class: One creates a builder, adds various fragments of fatbin-contained content (cubin, PTX, LTO IR etc.), then finally uses the build() or build_at() method to obtain the completed, final, fatbin file data, in a region of memory.
  • The project's CMake now exports a new target, cuda-api-wrappers::fatbin , which one should depend on when actually using the builder.
  • NVIDIA has not fully documented this library, so some functionality is not fully articulated, and some is only partially supported (specifically, passing extra options when adding LTO IR or PTX)

Support for more CUDA 12.x features

  • #669 : Can now obtain the kernels available in a given cuda::module_t, with the method unique_span<kernel_t> get_kernels() const.
  • #670 : Can now obtain a kernel's name and the module containing it via the kernel's handle; but - only the mangled kernel name is accessible, so giving that an appropriate method name: kernel_t::mangled_name() (regards #674)
  • #675 : Can now query CUDA's module loading mode (lazy or eager)

(Note these features are not accessible if you're using the wrappers with CUDA 11.x)

More unique_span class changes

Like a recently-cut gem, which one slowly polishes until it gains its proper shine... we had some work on unique_span in version 0.7.1 as well, and it continues in this version:

  • #678 : The deleter is now instance-specific, so it is possible to allocate in more than one way depending even on the span size - and also have the use of such unique-spans decoupled from the allocation decisions. Also, the deleter takes a span, not just a pointer, so it can make decisions based on the allocation size.
  • #665 :
    • Simplified the swap() implementation
    • Removed some redundant code
    • Shortened some code
    • Can now properly convert from a span of T to a span of const T.
    • Neither release(), nor our move construction, can be noexcept - removed that marking based only on optimism

optional_ref & partial unification of async and non-async memory operations

  • #691 : Added an optional_ref class, for passing optional arguments which are references. See this blog post by Foonathan about the problems of putting references in C++ optional's.
  • #689 : memory-related operations which had a cuda::memory::foo() and cuda::memory::async::foo() variants - now have a single variant, cuda::memory::foo(), which takes an extra optional_ref<stream_t> parameter: When it's not set, it's a synchronous operation; when it is set - the operation is asynchronous and scheduled on the stream. (But note the "fine print" w.r.t. synchronous and asynchronous CUDA operations in the Runtime and Driver API reference documents.)
  • #688 : Can now asynchronously copying 2D and 3D data using "copy parameters" structures
  • #687 : The synchronous and asynchronous variants of copy_single() had disagreed - one took a pointer, the other a reference. With their unification, they now agree (and take a pointer).

Bug fixes

Poor man's optional class

  • #664, #666 : Tweaked the class to avoid build failure in MSVC
  • #676 : value_or() now returns a value...
  • #682 : value_or() is now const

In example programs

  • #672 : The simpleCudaGraphs example program was crashing due to a gratuitous setting of launch config settings
  • #673 : Potential use-before-initialization in the simpleIPC exampl;e

Other changes

Build mechanism

  • #699 : Now exposing targets with a _static suffix, which in turn depend on the static versions of CUDA libraries, when those are available. For example, we now have both cuda-api-wrappers::rtc and cuda_api_wrappers::rtc_static
  • #694 : Now properly building fatbin files on systems with multiple GPUs of different compute capabilities

In the wrapper APIs themselves

  • #667 : Some dimension-classes methods are missing noexcept designators
  • #671 : A bit of macro renaming to avoid clashing with other libraries
  • #684 : Taking more linear sizes as size_t's in launch_config_builder_t's methods - so as to prevent narrowing-cast warnings and checking limits ourselves.
  • #686 : When loading a kernel from a library, one can now specify which context to obtain the kernel.
  • #698 : Add a shortcut function for getting the default device: cuda::device::default().

In example programs

  • #680 : Added an example program invoking cuBLAS for matrix multplication (cuBLAS is not wrapped by this library)
  • #683 : The random number generation is now the same in all vectorAdd example programs

Version 0.7.1: Work on unique_span, minor bug fixes and tweaks

16 Jul 08:38
Compare
Choose a tag to compare

Changes since v0.7.0:

CUDA Graphs

  • #532 Now supporting empty nodes as graph::typed_node's.
  • #656 When capturing a graph on a CUDA stream - default to capturing on using the global capture mode task
  • #657 Offer stand-alone functions for stream capture begin and end tasks (as opposed to only stream_t class methods)

unique_span class changes

Remember: unique_span<T, Deleter> is like a unique_ptr<T, Deleter> but with specified size, and interoperability with std::span/cuda::span.

  • #662 unique_span will no longer be constructible from uspans with other deleter classes via an implicit conversion into a span. Also,
  • it will no longer be constructible from untyped memory regions.
  • #660 unique_span assignement operator bug fixes.
  • #652 unique_span is now default-constructible.

Other changes

  • #651 Can now access the default stream of a non-primary context via stream_t
  • #653 Can now launch in an arbitrary CUDA context without referring to a specific stream (i.e. using the default stream for the context)
  • #654 The logic of context_t::create_module() is now the same as for context::module::create()
  • #659 The methods context_t::create_stream() and context_t::create_event() are now marked const

Version 0.7.0: Graph support

09 Jul 10:32
Compare
Choose a tag to compare

Changes since v0.6.9:

Graph support

The wrappers library now supports the creation and manipulation of CUDA graphs - meriting a version number bump.

One can now:

  • Construct graphs (= graph templates) directly
  • Capture graphs (= graph templates) on streams
  • Instantitate and launch graph templates

... all using a more convenient interface, similar to non-graph CUDA-API calls. Two examples of this kind of code have been added, both adaptations of NVIDIA CUDA samples:

The main class templates are: template_t, node_t, typed_node_t, instance_t - all in namespace cuda::graph.

Most, but not all, graph capabilities are supported.

Other minor changes

  • #649: Respect deprecation of shared memory bank size setting as of CUDA 12.3

Build-related

  • Avoiding more MSVC compilation warnings

Version 0.6.9: Documentation update, unique_span's, bug fixes, many small improvements

10 Mar 23:10
Compare
Choose a tag to compare

(This is planned to be the last release before 0.7.0, which will add support for CUDA graphs.)

Changes since v0.6.8:

Memory allocation & copying-related changes

  • #606 Can now copy directly to and from containers with contiguous storage - without going through pointers or specifying the size

Owning typed and untyped memory: unique_span and unique_region

  • #291 Added a unique_span<T> template class, combining the functionality of cuda::unique_ptr and cuda::span (and being somewhat similar to std::dynarray which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior of std::vector. This is now possible. Also implemented an untyped version of this, named unique_region.
  • #617 Replaced memory::external::mapped_region_t with memory::unique_region
  • #601 Added an empty() method to cuda::span (to match that of std::span - as it is no sometimes used)
  • #603 Use unique_span instead of our cuda::dynarray (which had been an std::vector under the hood), in various places in the API, especially RTC
  • #610 Return unique_span's from the cuda::rtc::program_output class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.

More robust memory regions (memory::region_t)

  • #592 Changed the approach used in v0.6.8 to bring managed regions and general regions in line with each other; now, memory::managed::region_t inherits memory::region_t
  • #594 Now using memory::region_t for mapped memory rather than a different, mapped-memory specific region class
  • #602 Make memory::region_t more constexpr-friendly
  • #604 memory::region_t's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitions
  • #605 Can now construct const_region_t's from rvalue references to regions
  • #640 User no longer needs to know about range_attribute_t or advice_t - those are left to detail_ namespaces; also, fixed implementation of attribute setting for device-inspecific attributes
  • #647 Mapped memory: Can now implicitly convert memory::mapped::span_pair_t<T> into a pair of region_t's

Documentation & comments

  • #595 Correct the documentation for supports_memory_pools

Launch configuration & launch config builder changes

  • #596 Corrected a check against the associated device in the kernel-setting method of the launch config builder
  • #619, #618 Fixed launch configuration comparisons and now user defaulted comparison
  • #619 Fixed a bug in checking whether some CUDA-12-introduced launch config parameters are set

CUDA libraries and in-library, non-associated kernel support

  • #598 Corrected the API and implementation of get_attribute() and set_attribute() for library kernels

Internal refactoring

  • #607 Split off a detail/type_traits.hpp from types.hpp
  • #620 context::current::scoped_override_t now declared in current_context.hpp
  • #611 Reduced code repetition between context_t and primary_context_t.
  • #622 link::marshalled_options_t and link::option_t are now in the detail_ namespace - the user should typically never used
  • #624 Now collecting the log-related link options into a sub-structure of link::options_t
  • #625 Dropped specify_default_load_caching_mode from link::options_t, in favor of using an stdx::optional
  • #626 Now using optional's instead of bespoke constructs in pci_location_t
  • #628 Corrected the signature of context::current::peer_to_peer functions
  • #630 Moved program_base_t into the detail_ namespace
  • #632 Move rtc::marshalled_options_t and rtc::marshal() into the detail_ subnamespace - users should not need to use this themselves
  • #643 Moved memory::pool::ipc::ptr_handle_t out of ipc.hpp up into types.hpp (so that memory_pool.hpp doesn't depend on ipc.hpp)
  • #621 Renamed: link::fallback_strategy_t -> link::fallback_strategy_for_binary_code_t
  • #600 Now adhering to underscore suffix for proxy class field names

Other changes

  • #599 An invalid file, name_caching_program.hpp, had snuck into our code - removed it
  • #609 "Robustified" the buffers returned from cuda::rtc::program_output's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's be
  • on the safe side.
  • #627 Dropped context-specification from host-memory allocator functions - it's not actually used
  • #629 Added a device ID field to the texture view object
  • #631 Dropped examples/rtc_common.hpp, which is no longer in which now
  • #638 Dropped the native_word_t type
  • #639 Simplified the memory access permissions code somewhat + some renaming (access_permissions -> permissions)
  • #637 Devices and contexts no longer have flag-related members in their public interface; these are now just implementation details
  • #645 Bug fix: Now using the correct free() function in memory::managed::detail_::deleter

Version 0.6.8: CUDA 12.x features, launch config builder improvements, etc.

26 Feb 22:45
Compare
Choose a tag to compare

Changes since v0.6.7:

Build process & build configuration changes

  • #583 Exported targets now make sure apps link against some system libraries CUDA depends on (to circumvent CMake bug 25665).
  • #567 The runtime-and-driver target now has driver-and-runtime as an alias
  • #590 Avoid compiled warnings about narrowing conversions and shadowing (when those warning flags are turned on)

Launch configuration & launch config builder changes

  • #564 Can now launch kernels with the full range of CUDA 12.x launch attributes, including remote memory sync domain, programmatic launch dependence and programmatic completion events (see descriptions in the CUDA Driver API documentation).
  • #484 Support for setting block cluster dimensions (as part of the support of CUDA 12.x launch attributes).
  • #577, #582 More extensive validation of launch configurations when building them with the launch config builder gadget.
  • #581 More robust comparison operators for dimension structures
  • #580 Launch config builder can now be told to the "use the maximum number of active blocks per multiprocessor".
  • #579 User can now set a target device on a launch configuration target device without setting a contextualized kernel, in a launch config builder
  • #578 Now using the launch config builder in more of the example programs
  • #569 Took care of unused validation function which was triggering a warning with newer compilers

CUDA libraries and in-library, non-associated kernel support

  • #565 Now supporting "CUDA libraries" - files or blocks of data in memory containing compiled kernels, which are not loaded immediately into modules within contexts; and contain device-and-context-independent compiled kernels. Both of these can now be represented and worked with.
  • #576 A module no longer holds the link options it was created with; those are not essential to its use, and at times are impossible to (re)create when obtaining a module from a library (which also doesn't hold its link options ).

Refactoring

  • #586 The poor man's span class now has its own file (detail/span.hpp)
  • #588 Some under-the-hood refactoring of host memory allocation functions
  • #589 Factored the cuda::memory::region_t class into its own file

Other changes

  • #593 Some work on the cuda::memory::copy_parameters_t structure
  • #592 Dropped the memory::managed::region_t and const_region_t and now just using memory::region_t and const_region_t everywhere
  • #591 Memory copy functions for spans and other work on memory copy functions
  • #587 Added a missing variant of memory-zero'ing
  • #585 You can now write cuda::memory::make_unique() - and it's assumed you mean device memory (you have to specify a device or device context though)
  • #575 Moved apriori_compiled_kernel_t into the kernel namespace, yielding kernel::apriori_compiled_it
  • #570, #573 Removed some redundant inclusions and definitions
  • #568 Fixed some breakage of kernel_t::set_attribute()
  • #566 Can now properly get and set properties on kernels (raw functions and handles)

Compatibility

  • #572 Fixed broken CUDA 9.x compatibility

Version 0.6.7: Build & compatibility improvements

27 Dec 16:15
Compare
Choose a tag to compare

Changes since v0.6.6:

Build process & build configuration changes

  • #555 : No longer enabling any languages when configuring the project - it is not strictly necessary at that point.
  • #557 : Now setting EXPORT_ALL_SYMBOLS as a target property rather than globally (relevant for Windows builds mostly).
  • #558 : Now avoiding incompatibility with CMake 3.25-3.28, by avoiding adding a dependency to a CUDA:: target, and not trying to recreate it if it already exists.
  • #562 Slightly streamlined the package config file (and no longer preprocessing it).
  • #563 Now enabling the C++ language before declaring the dependence (with find_package()) on the Threads library.

Compatibility

  • #506 : Windows builds no longer fail due to missing cudaProfiler.h
  • #504 : Build on Windows should now also succeed when using cooperative groups.

Other changes

  • #556 Moved the CompileWithWarnings.cmake module into the examples/ subfolder, which is the only place it's used.
  • #561 Dropped an unused function and method in the launch config builder source code and class, respectively, and added some orphaned validation logic when obtaining the grid and block dimensions, so as not to ignore the overall dimensions.

  • Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
  • Have you tried this version and are satisfied with the changes? Thumb-up just below this line to let others know as well.

Version 0.6.6: Minor bug fixes + compatibility improvements

04 Oct 20:27
Compare
Choose a tag to compare

Changes since v0.6.4:

Functionality improvements

  • #545 Now checking, and throwing, errors due to cudaGetLastError() after kernel launches (mostly grid errors)
  • #547 When compiling in debug mode, now performing more launch configuration validity checks before launching a kernel
  • #549 Avoiding some excessive device property querying.

Bug fixes

  • #539, #544 NVRTC compilation logs now returned without a trailing nul ('\0') character.
  • #542 More robust use of namespace in the library's macros, so they don't trigger compilation errors regardless of the namespace of the code you use them in
  • #543 Now retrieving correct error strings again for Runtime-API-only errors
  • #550 Fixed a wrong side of comparison in some block configuration logic of the launch config builder
  • #553 Replaced inappropriate use of cbegin() and cend() in favor of begin() and end() in rtc::program::add_headers() code which may take inputs without these two methods.

Compatibility

  • #546 Resolved a build on Windows with rtc.hpp - fixed an overload resolution issue regarding compilation parameter marshalling.

Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)

Version 0.6.4 : Minor changes and bug fixes

26 Jul 21:21
Compare
Choose a tag to compare

Changes since v0.6.3:

Features

  • #528 Can now register memory areas with the CUDA driver as read-only (in addition to other parameters)
  • #519 Quick & somewhat-dirty support for the use of "external memory resources" (mostly Direct3D or NVSCI buffers and semaphores; but not including semaphores etc. for now)

Bug fixes

  • #535 copy_parameters_t::set_endpoint_untyped() now properly calling an inner set_endpoint_untyped()
  • #527 copy_parameters_t::set_single_context no longer mistakenly taking an endpoint_t parameter
  • #528 When mapping a region pair, not insisting on the same address on both the host and the device (which had made it impossible for this to succeed with older GPUs).
  • #521 Avoid a compiler compiler warning when overriding the context for the current scope
  • #520 Removed unnecessary uses of a context-wrapper scoped-context-override
  • #517 cuda::memory::typed_set() no longer mistakenly accepts values of size 8 (which have no special CUDA API call).
  • #516 Corrected types and casting in stream_t::enqueue_t::write_single_value()
  • #515 Resolved a case of missing includes when including only certain headers
  • #514 Now providing a definition of cuda::memory::managed::allocate(device, num_bytes) (which was declared but not defined)

Other changes

  • #522 Renamed synch to sync in multiple identifiers
  • #529 program_t::add_registered_globals() can now take any container of any string-like type.
  • #523 Now passing device_t's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context.
  • #521 Reduce boilerplate + avoid warning when overriding context for the current scope resolved-on-development task

Build issues

  • #504 Fixed build failure with cooperative_groups on GitHub Actions Windows runners and CUDA >= 11.7

Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)

Version 0.6.3 - Memory pool support, compatibility improvements, minor changes

29 Apr 17:33
Compare
Choose a tag to compare

Changes since v0.6.2:

Memory pool support (#249) and related changes

  • Added a cuda::memory::pool_t proxy class
  • Memory pools are created using cuda::memory::pool::create() or via device methods
  • IPC: Can import (and export) pools and their allocations to/from other processes
  • #485 Moved the physical_allocation namespace up from memory::virtual_ into memory, as it is used also for memory pools

Bug fixes

  • #508 With CUDA >= 11.3, we no longer give up on creating a module from a compiled program just because no CUBIN is available - and try to use the PTX like with earlier CUDA versions.
  • #493 cuda::launch_config_builder_t::overall_size() now takes a cuda::grid::overall_dimension_t rather than a cuda::grid::dimension_t (not the same size).
  • #492 Avoiding inclusion of cooperative_groups.h from within the API headers

Other API changes

  • #511 Can now create CUDA runtime errors with fully-user-overriden what() message (probably not very interesting outside the API's internals).
  • #510 When NVRTC/PTX compiler complains about an invalid option, we now include the options passed in the thrown exceptions
  • #499 No longer exposing deprecated surface-related API functions with CUDA 12 and later.
  • #498 The launch config builder class now supports num_blocks() as an alias for the grid_size() method.
  • #488 cuda::memory::host::allocate() now returns a cuda::memory::region_t, for better consistency.
  • #486 Some changes to cuda::kernel::wrap.
  • Renamed cuda::memory::attribute_value_type_t -> cuda::memory::attribute_value_t
  • #483 It's now easier to convert memory::region_t's into typed spans.
  • #482 Improvements to the built-in cuda::span (which is used when std::span is unavailable) - making it somewhat more compatible with std::span
  • Make more comparison operators constexpr and noexcept

Compatibility

  • Wrappers now build correctly (again) with --std=c++20.
  • #501 Added a new NVRTC error code introduced in CUDA 12.1
  • #500 When using CUDA 12, use the term "LTO IR" rather than "NVVM" as appropriate
  • #494 Work around an MSVC issue with variadic template-templates
  • #491 Avoiding some warnings issued by MSVC
  • #480 Add example program built with each C++ version after 11 supported by the compiler

Build issues

  • Now requiring CMake version 3.25. You can download an up-to-date version from Kitware's website; it doesn't require any special installation.
  • #490 Switched from depending on CUDA::nvToolkitExt to depending on CUDA::nvtx, for CUDA versions 10.0 and above.

Version 0.6.2: Stream callback mechanism change

18 Mar 09:51
Compare
Choose a tag to compare

Changes since v0.6.1:

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

  • A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
  • An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

  • #473 Redesign of host function / callback enqueue and launch mechanism, see above
  • #459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)
  • #477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.

API additions

  • #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
  • #472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
  • #466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
  • #463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
  • #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)
  • A method of the launch configuration builder for indicating no dynamic shared memory is used

Bug fixes

  • #475 device::get() no longer incorrectly marked as noexcept
  • #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
  • #455 Add missing definition of allocate_managed()~ in context.hpp`
  • #453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
  • #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
  • #449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
  • #448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

  • #443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

  • #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
  • #304 Some CUDA 9.x incompatibilities have been fixed

Other changes

  • #471 Made a few more comparison operators constexpr