Releases: eyalroz/cuda-api-wrappers
Version 0.8.0 Release Candidate 1: nvFatbin , CUDA 12.x features, sync-async op unification, static targets etc.
Changes since v0.7.1:
Support for the nvFatbin
library (#681)
- The API wrappers now support NVIDIA's "fat binary" file format creation/marshalling library, nvFatbin. It is supported via a
cuda::fatbin_builder_t
class: One creates a builder, adds various fragments of fatbin-contained content (cubin, PTX, LTO IR etc.), then finally uses thebuild()
orbuild_at()
method to obtain the completed, final, fatbin file data, in a region of memory. - The project's CMake now exports a new target,
cuda-api-wrappers::fatbin
, which one should depend on when actually using the builder. - NVIDIA has not fully documented this library, so some functionality is not fully articulated, and some is only partially supported (specifically, passing extra options when adding LTO IR or PTX)
Support for more CUDA 12.x features
- #669 : Can now obtain the kernels available in a given
cuda::module_t
, with the methodunique_span<kernel_t> get_kernels() const
. - #670 : Can now obtain a kernel's name and the module containing it via the kernel's handle; but - only the mangled kernel name is accessible, so giving that an appropriate method name:
kernel_t::mangled_name()
(regards #674) - #675 : Can now query CUDA's module loading mode (lazy or eager)
(Note these features are not accessible if you're using the wrappers with CUDA 11.x)
More unique_span
class changes
Like a recently-cut gem, which one slowly polishes until it gains its proper shine... we had some work on unique_span in version 0.7.1 as well, and it continues in this version:
- #678 : The deleter is now instance-specific, so it is possible to allocate in more than one way depending even on the span size - and also have the use of such unique-spans decoupled from the allocation decisions. Also, the deleter takes a span, not just a pointer, so it can make decisions based on the allocation size.
- #665 :
- Simplified the
swap()
implementation - Removed some redundant code
- Shortened some code
- Can now properly convert from a span of
T
to a span ofconst T
. - Neither
release()
, nor our move construction, can benoexcept
- removed that marking based only on optimism
- Simplified the
optional_ref
& partial unification of async and non-async memory operations
- #691 : Added an
optional_ref
class, for passing optional arguments which are references. See this blog post by Foonathan about the problems of putting references in C++ optional's. - #689 : memory-related operations which had a
cuda::memory::foo()
andcuda::memory::async::foo()
variants - now have a single variant,cuda::memory::foo()
, which takes an extraoptional_ref<stream_t>
parameter: When it's not set, it's a synchronous operation; when it is set - the operation is asynchronous and scheduled on the stream. (But note the "fine print" w.r.t. synchronous and asynchronous CUDA operations in the Runtime and Driver API reference documents.) - #688 : Can now asynchronously copying 2D and 3D data using "copy parameters" structures
- #687 : The synchronous and asynchronous variants of
copy_single()
had disagreed - one took a pointer, the other a reference. With their unification, they now agree (and take a pointer).
Bug fixes
Poor man's optional class
- #664, #666 : Tweaked the class to avoid build failure in MSVC
- #676 :
value_or()
now returns a value... - #682 :
value_or()
is now const
In example programs
- #672 : The simpleCudaGraphs example program was crashing due to a gratuitous setting of launch config settings
- #673 : Potential use-before-initialization in the simpleIPC exampl;e
Other changes
Build mechanism
- #699 : Now exposing targets with a
_static
suffix, which in turn depend on the static versions of CUDA libraries, when those are available. For example, we now have bothcuda-api-wrappers::rtc
andcuda_api_wrappers::rtc_static
- #694 : Now properly building fatbin files on systems with multiple GPUs of different compute capabilities
In the wrapper APIs themselves
- #667 : Some dimension-classes methods are missing noexcept designators
- #671 : A bit of macro renaming to avoid clashing with other libraries
- #684 : Taking more linear sizes as
size_t
's inlaunch_config_builder_t
's methods - so as to prevent narrowing-cast warnings and checking limits ourselves. - #686 : When loading a kernel from a library, one can now specify which context to obtain the kernel.
- #698 : Add a shortcut function for getting the default device:
cuda::device::default()
.
In example programs
Version 0.7.1: Work on unique_span, minor bug fixes and tweaks
Changes since v0.7.0:
CUDA Graphs
- #532 Now supporting empty nodes as
graph::typed_node
's. - #656 When capturing a graph on a CUDA stream - default to capturing on using the global capture mode task
- #657 Offer stand-alone functions for stream capture begin and end tasks (as opposed to only stream_t class methods)
unique_span
class changes
Remember: unique_span<T, Deleter>
is like a unique_ptr<T, Deleter>
but with specified size, and interoperability with std::span
/cuda::span
.
- #662
unique_span
will no longer be constructible from uspans with other deleter classes via an implicit conversion into a span. Also, - it will no longer be constructible from untyped memory regions.
- #660
unique_span
assignement operator bug fixes. - #652
unique_span
is now default-constructible.
Other changes
- #651 Can now access the default stream of a non-primary context via
stream_t
- #653 Can now launch in an arbitrary CUDA context without referring to a specific stream (i.e. using the default stream for the context)
- #654 The logic of
context_t::create_module()
is now the same as forcontext::module::create()
- #659 The methods
context_t::create_stream()
andcontext_t::create_event()
are now markedconst
Version 0.7.0: Graph support
Changes since v0.6.9:
Graph support
The wrappers library now supports the creation and manipulation of CUDA graphs - meriting a version number bump.
One can now:
- Construct graphs (= graph templates) directly
- Capture graphs (= graph templates) on streams
- Instantitate and launch graph templates
... all using a more convenient interface, similar to non-graph CUDA-API calls. Two examples of this kind of code have been added, both adaptations of NVIDIA CUDA samples:
The main class templates are: template_t
, node_t
, typed_node_t
, instance_t
- all in namespace cuda::graph
.
Most, but not all, graph capabilities are supported.
Other minor changes
- #649: Respect deprecation of shared memory bank size setting as of CUDA 12.3
Build-related
- Avoiding more MSVC compilation warnings
Version 0.6.9: Documentation update, unique_span's, bug fixes, many small improvements
(This is planned to be the last release before 0.7.0, which will add support for CUDA graphs.)
Changes since v0.6.8:
Memory allocation & copying-related changes
- #606 Can now copy directly to and from containers with contiguous storage - without going through pointers or specifying the size
Owning typed and untyped memory: unique_span
and unique_region
- #291 Added a
unique_span<T>
template class, combining the functionality ofcuda::unique_ptr
andcuda::span
(and being somewhat similar tostd::dynarray
which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior ofstd::vector
. This is now possible. Also implemented an untyped version of this, namedunique_region
. - #617 Replaced
memory::external::mapped_region_t
withmemory::unique_region
- #601 Added an
empty()
method tocuda::span
(to match that ofstd::span
- as it is no sometimes used) - #603 Use
unique_span
instead of ourcuda::dynarray
(which had been anstd::vector
under the hood), in various places in the API, especially RTC - #610 Return
unique_span
's from thecuda::rtc::program_output
class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.
More robust memory regions (memory::region_t
)
- #592 Changed the approach used in v0.6.8 to bring managed regions and general regions in line with each other; now,
memory::managed::region_t
inheritsmemory::region_t
- #594 Now using
memory::region_t
for mapped memory rather than a different, mapped-memory specific region class - #602 Make
memory::region_t
more constexpr-friendly - #604
memory::region_t
's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitions - #605 Can now construct
const_region_t
's from rvalue references to regions - #640 User no longer needs to know about
range_attribute_t
oradvice_t
- those are left todetail_
namespaces; also, fixed implementation of attribute setting for device-inspecific attributes - #647 Mapped memory: Can now implicitly convert
memory::mapped::span_pair_t<T>
into a pair ofregion_t
's
Documentation & comments
- #595 Correct the documentation for
supports_memory_pools
Launch configuration & launch config builder changes
- #596 Corrected a check against the associated device in the kernel-setting method of the launch config builder
- #619, #618 Fixed launch configuration comparisons and now user defaulted comparison
- #619 Fixed a bug in checking whether some CUDA-12-introduced launch config parameters are set
CUDA libraries and in-library, non-associated kernel support
- #598 Corrected the API and implementation of
get_attribute()
andset_attribute()
for library kernels
Internal refactoring
- #607 Split off a
detail/type_traits.hpp
fromtypes.hpp
- #620
context::current::scoped_override_t
now declared incurrent_context.hpp
- #611 Reduced code repetition between
context_t
andprimary_context_t
. - #622
link::marshalled_options_t
andlink::option_t
are now in thedetail_
namespace - the user should typically never used - #624 Now collecting the log-related link options into a sub-structure of
link::options_t
- #625 Dropped
specify_default_load_caching_mode
fromlink::options_t
, in favor of using anstdx::optional
- #626 Now using optional's instead of bespoke constructs in
pci_location_t
- #628 Corrected the signature of
context::current::peer_to_peer
functions - #630 Moved
program_base_t
into thedetail_
namespace - #632 Move
rtc::marshalled_options_t
andrtc::marshal()
into thedetail_
subnamespace - users should not need to use this themselves - #643 Moved
memory::pool::ipc::ptr_handle_t
out ofipc.hpp
up intotypes.hpp
(so thatmemory_pool.hpp
doesn't depend onipc.hpp
) - #621 Renamed:
link::fallback_strategy_t
->link::fallback_strategy_for_binary_code_t
- #600 Now adhering to underscore suffix for proxy class field names
Other changes
- #599 An invalid file,
name_caching_program.hpp
, had snuck into our code - removed it - #609 "Robustified" the buffers returned from
cuda::rtc::program_output
's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's be - on the safe side.
- #627 Dropped context-specification from host-memory allocator functions - it's not actually used
- #629 Added a device ID field to the texture view object
- #631 Dropped
examples/rtc_common.hpp
, which is no longer in which now - #638 Dropped the
native_word_t
type - #639 Simplified the memory access permissions code somewhat + some renaming (
access_permissions
->permissions
) - #637 Devices and contexts no longer have
flag
-related members in their public interface; these are now just implementation details - #645 Bug fix: Now using the correct
free()
function inmemory::managed::detail_::deleter
Version 0.6.8: CUDA 12.x features, launch config builder improvements, etc.
Changes since v0.6.7:
Build process & build configuration changes
- #583 Exported targets now make sure apps link against some system libraries CUDA depends on (to circumvent CMake bug 25665).
- #567 The runtime-and-driver target now has
driver-and-runtime
as an alias - #590 Avoid compiled warnings about narrowing conversions and shadowing (when those warning flags are turned on)
Launch configuration & launch config builder changes
- #564 Can now launch kernels with the full range of CUDA 12.x launch attributes, including remote memory sync domain, programmatic launch dependence and programmatic completion events (see descriptions in the CUDA Driver API documentation).
- #484 Support for setting block cluster dimensions (as part of the support of CUDA 12.x launch attributes).
- #577, #582 More extensive validation of launch configurations when building them with the launch config builder gadget.
- #581 More robust comparison operators for dimension structures
- #580 Launch config builder can now be told to the "use the maximum number of active blocks per multiprocessor".
- #579 User can now set a target device on a launch configuration target device without setting a contextualized kernel, in a launch config builder
- #578 Now using the launch config builder in more of the example programs
- #569 Took care of unused validation function which was triggering a warning with newer compilers
CUDA libraries and in-library, non-associated kernel support
- #565 Now supporting "CUDA libraries" - files or blocks of data in memory containing compiled kernels, which are not loaded immediately into modules within contexts; and contain device-and-context-independent compiled kernels. Both of these can now be represented and worked with.
- #576 A module no longer holds the link options it was created with; those are not essential to its use, and at times are impossible to (re)create when obtaining a module from a library (which also doesn't hold its link options ).
Refactoring
- #586 The poor man's span class now has its own file (
detail/span.hpp
) - #588 Some under-the-hood refactoring of host memory allocation functions
- #589 Factored the
cuda::memory::region_t
class into its own file
Other changes
- #593 Some work on the
cuda::memory::copy_parameters_t
structure - #592 Dropped the
memory::managed::region_t
andconst_region_t
and now just usingmemory::region_t
andconst_region_t
everywhere - #591 Memory copy functions for spans and other work on memory copy functions
- #587 Added a missing variant of memory-zero'ing
- #585 You can now write
cuda::memory::make_unique()
- and it's assumed you mean device memory (you have to specify a device or device context though) - #575 Moved
apriori_compiled_kernel_t
into thekernel
namespace, yieldingkernel::apriori_compiled_it
- #570, #573 Removed some redundant inclusions and definitions
- #568 Fixed some breakage of
kernel_t::set_attribute()
- #566 Can now properly get and set properties on kernels (raw functions and handles)
Compatibility
- #572 Fixed broken CUDA 9.x compatibility
Version 0.6.7: Build & compatibility improvements
Changes since v0.6.6:
Build process & build configuration changes
- #555 : No longer enabling any languages when configuring the project - it is not strictly necessary at that point.
- #557 : Now setting EXPORT_ALL_SYMBOLS as a target property rather than globally (relevant for Windows builds mostly).
- #558 : Now avoiding incompatibility with CMake 3.25-3.28, by avoiding adding a dependency to a CUDA:: target, and not trying to recreate it if it already exists.
- #562 Slightly streamlined the package config file (and no longer preprocessing it).
- #563 Now enabling the C++ language before declaring the dependence (with
find_package()
) on the Threads library.
Compatibility
- #506 : Windows builds no longer fail due to missing
cudaProfiler.h
- #504 : Build on Windows should now also succeed when using cooperative groups.
Other changes
- #556 Moved the
CompileWithWarnings.cmake
module into theexamples/
subfolder, which is the only place it's used. - #561 Dropped an unused function and method in the launch config builder source code and class, respectively, and added some orphaned validation logic when obtaining the grid and block dimensions, so as not to ignore the overall dimensions.
- Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
- Have you tried this version and are satisfied with the changes? Thumb-up just below this line to let others know as well.
Version 0.6.6: Minor bug fixes + compatibility improvements
Changes since v0.6.4:
Functionality improvements
- #545 Now checking, and throwing, errors due to
cudaGetLastError()
after kernel launches (mostly grid errors) - #547 When compiling in debug mode, now performing more launch configuration validity checks before launching a kernel
- #549 Avoiding some excessive device property querying.
Bug fixes
- #539, #544 NVRTC compilation logs now returned without a trailing nul (
'\0'
) character. - #542 More robust use of namespace in the library's macros, so they don't trigger compilation errors regardless of the namespace of the code you use them in
- #543 Now retrieving correct error strings again for Runtime-API-only errors
- #550 Fixed a wrong side of comparison in some block configuration logic of the launch config builder
- #553 Replaced inappropriate use of
cbegin()
andcend()
in favor ofbegin()
andend()
inrtc::program::add_headers()
code which may take inputs without these two methods.
Compatibility
- #546 Resolved a build on Windows with
rtc.hpp
- fixed an overload resolution issue regarding compilation parameter marshalling.
Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
Version 0.6.4 : Minor changes and bug fixes
Changes since v0.6.3:
Features
- #528 Can now register memory areas with the CUDA driver as read-only (in addition to other parameters)
- #519 Quick & somewhat-dirty support for the use of "external memory resources" (mostly Direct3D or NVSCI buffers and semaphores; but not including semaphores etc. for now)
Bug fixes
- #535
copy_parameters_t::set_endpoint_untyped()
now properly calling an innerset_endpoint_untyped()
- #527
copy_parameters_t::set_single_context
no longer mistakenly taking anendpoint_t
parameter - #528 When mapping a region pair, not insisting on the same address on both the host and the device (which had made it impossible for this to succeed with older GPUs).
- #521 Avoid a compiler compiler warning when overriding the context for the current scope
- #520 Removed unnecessary uses of a context-wrapper scoped-context-override
- #517
cuda::memory::typed_set()
no longer mistakenly accepts values of size 8 (which have no special CUDA API call). - #516 Corrected types and casting in
stream_t::enqueue_t::write_single_value()
- #515 Resolved a case of missing includes when including only certain headers
- #514 Now providing a definition of
cuda::memory::managed::allocate(device, num_bytes)
(which was declared but not defined)
Other changes
- #522 Renamed
synch
tosync
in multiple identifiers - #529
program_t::add_registered_globals()
can now take any container of any string-like type. - #523 Now passing
device_t
's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context. - #521 Reduce boilerplate + avoid warning when overriding context for the current scope resolved-on-development task
Build issues
- #504 Fixed build failure with cooperative_groups on GitHub Actions Windows runners and CUDA >= 11.7
Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
Version 0.6.3 - Memory pool support, compatibility improvements, minor changes
Changes since v0.6.2:
Memory pool support (#249) and related changes
- Added a
cuda::memory::pool_t
proxy class - Memory pools are created using
cuda::memory::pool::create()
or via device methods - IPC: Can import (and export) pools and their allocations to/from other processes
- #485 Moved the
physical_allocation
namespace up frommemory::virtual_
intomemory
, as it is used also for memory pools
Bug fixes
- #508 With CUDA >= 11.3, we no longer give up on creating a module from a compiled program just because no CUBIN is available - and try to use the PTX like with earlier CUDA versions.
- #493
cuda::launch_config_builder_t::overall_size()
now takes acuda::grid::overall_dimension_t
rather than acuda::grid::dimension_t
(not the same size). - #492 Avoiding inclusion of
cooperative_groups.h
from within the API headers
Other API changes
- #511 Can now create CUDA runtime errors with fully-user-overriden
what()
message (probably not very interesting outside the API's internals). - #510 When NVRTC/PTX compiler complains about an invalid option, we now include the options passed in the thrown exceptions
- #499 No longer exposing deprecated surface-related API functions with CUDA 12 and later.
- #498 The launch config builder class now supports
num_blocks()
as an alias for thegrid_size()
method. - #488
cuda::memory::host::allocate()
now returns acuda::memory::region_t
, for better consistency. - #486 Some changes to
cuda::kernel::wrap
. - Renamed
cuda::memory::attribute_value_type_t
->cuda::memory::attribute_value_t
- #483 It's now easier to convert
memory::region_t
's into typed spans. - #482 Improvements to the built-in
cuda::span
(which is used whenstd::span
is unavailable) - making it somewhat more compatible withstd::span
- Make more comparison operators
constexpr
andnoexcept
Compatibility
- Wrappers now build correctly (again) with
--std=c++20
. - #501 Added a new NVRTC error code introduced in CUDA 12.1
- #500 When using CUDA 12, use the term "LTO IR" rather than "NVVM" as appropriate
- #494 Work around an MSVC issue with variadic template-templates
- #491 Avoiding some warnings issued by MSVC
- #480 Add example program built with each C++ version after 11 supported by the compiler
Build issues
- Now requiring CMake version 3.25. You can download an up-to-date version from Kitware's website; it doesn't require any special installation.
- #490 Switched from depending on
CUDA::nvToolkitExt
to depending onCUDA::nvtx
, for CUDA versions 10.0 and above.
Version 0.6.2: Stream callback mechanism change
Changes since v0.6.1:
The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.
So far, a callback was any object invokable with an std::stream_t
parameter. From now on, we support two kinds of callback:
- A plain function - not a closure, which may be invoked with a pointer to an arbitrary type:
cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
- An object invokable with no parameters - a closure, to which one cannot provide any additional information:
cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)
This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.
Of course, the release also has s the "usual" long list of minor fixes.
Changes to existing API
- #473 Redesign of host function / callback enqueue and launch mechanism, see above
- #459
cuda::kernel::get()
now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available) - #477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.
API additions
- #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
- #472 Can now pass
cuda::memory::region_t
's when enqueueing copy operations on streams (and thus alsocuda::span<T>
's) - #466 Can now perform copies using
cuda::memory::copy_parameters_t<N>
(for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields. - #463 Can now obtain a raw pointer's context and device without first wrapping it in a
cuda::pointer_t
- #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)
- A method of the launch configuration builder for indicating no dynamic shared memory is used
Bug fixes
- #475
device::get()
no longer incorrectly marked asnoexcept
- #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
- #455 Add missing definition of
allocate_managed()~ in
context.hpp` - #453 Now actually setting the flags when enqueueing a
flush_remote_writes()
operation on a stream (this is one of the "batch stream memory operations) - #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
- #449
apriori_compiled_kernel_t::get_attribute()
was missing aninline
decoration - #448
cuda::profiling::mark::range_start()
andrange_end()
were callingcreate_attributions()
the wrong way
Cleanup and warning avoidance
- #443 Aligned member initialization order(s) in array_t with their declaration order.
Compatibility
- #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
- #304 Some CUDA 9.x incompatibilities have been fixed
Other changes
- #471 Made a few more comparison operators
constexpr