Releases · NVIDIA/cuDecomp

13 Aug 20:09

v0.5.1

b34e3f2

v0.5.1 Latest

Latest

What's Changed

This release includes a couple of new features in cuDecomp, minor performance improvements, and documentation fixes. This release adds the ability to capture packing kernels for the pipelined backends in CUDA graphs for better latency in some cases (enabled via new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS, see #68 for more details). This release also adds a new performance reporting feature to report timing breakdowns of the communication and local processing time of transpose and halo operations launched in a user workload (enabled via new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT, see #75 for more details).

Breaking changes

None.

Deprecations

None.

PRs included in this release

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. (#68)
Fix Fortran documentation of transpose_halo_extents and transpose_padding autotuning options. (#70)
Better guarding of calls to cudecompAlltoall and cudecompAlltoallPipelined in transpose implementation. (#71)
Upgrade C++ standard for builds to C++17 for libcu++ compatibility. (#72)
Improve initial NVSHMEM team synchronization for transpose ops. (#74)
Improvements to NCCL communicator management. (#73)
Fix non-nvshmem builds. (#76)
Use MPI_Alltoall instead of MPI_Alltoallv if able. (#77)
Fix undefined NVML_GPU_FABRIC_UUID_LEN usage for builds against older CUDA toolkits. (#78)
Add basic CI for compilation testing and code format checks. (#79)
Pin clang format version for CI. (#80)
Add performance reporting feature. (#75)
Fix C++ std::filesystem linking for older GCC toolchains. (#81)
Documentation build updates and automation. (#82)
Improve NVSHMEM backend CE scheduling. (#83)

Full Changelog: v0.5.0...v0.5.1

Assets 2

08 Apr 22:36

romerojosh

v0.5.0

1083144

v0.5.0

What's Changed

This release includes a number of major updates to cuDecomp. This release adds new features to make cuDecomp more flexible for users (more customizable memory orderings by pencil axis via new transpose_mem_order configuration option and support for input/output buffer padding in transpose and halo update APIs). This release also improves support for multi-node NVLINK (MNNVL) equipped clusters with opt-in support for fabric allocated cuDecomp workspace memory. Beyond this, this release includes expanded autotuning options and general improvements.

Breaking changes

#60 adds a new padding argument to several cuDecomp APIs: cudecompGetPencilInfo, cudecompTranspose*, and cudecompHaloUpdate* functions. This will require updates to existing C++ code and Fortran code (depending on usage). See #60 and documentation for more details.

Deprecations

The Makefile-based build has been removed.

PRs included in this release

Made it possible to include library header from pure C program (#40)
Adding Fortran version of Taylor Green example (#41)
Fix integer overflow issue with C++ TG example for large problems. (#42)
Benchmark updates (#43)
Use unique ID based NVSHMEM initialization method for newer NVSHMEM versions (#44)
Removing Makefile build support and related files. (#45)
Add missing preprocessor guards to fix compilation without NVSHMEM enabled. (#46)
Add small MPI_Alltoall after autotuning to work around MPI memory registration delaying cudaFree. (#47)
Address narrowing conversion errors/warnings. (#48)
Add new transpose_mem_order configuration argument to enable more flexible pencil memory layouts. (#49)
Add opt-in support for fabric-registered workspace allocations via cuMem* APIs. (#50)
Dynamically load CUDA driver functions at runtime. (#51)
Increase buffer size used in post-autotuning MPI_Alltoall. (#52)
Fix integer overflow issue in Fortran poisson example. (#53)
Extend transpose shortcut handling to cases with halos. (#54)
Fix bug in handling of NVSHMEM halo backends from recent change. (#55)
Improve multi-node NVLink topology detection and communication ordering using NVML utilities. (#56)
Fix CUDART_VERSION guard for nvmlDeviceGetGpuFabricInfoV to restrict usage to CUDA >= 12.4. (#57)
Silence messages about NVML symbols failing to load. (#58)
Improve tests (#59)
Preserve original user transpose_mem_order settings after grid descriptor creation. (#61)
Add support for padded input/output buffers in transpose and halo communication routines (#60)
Improvements to batched memcpy kernel implementation. (#62)
Remove redundant axis-contiguous/transpose_mem_order configurations from halo tests. Update axis-contiguous test configurations to not supply transpose_mem_order argument. (#63)
Add Blackwell (cc100) support to default builds when using CUDA 12.8 or newer. (#64)
C++ Taylor Green example updates. (#65)
Add new autotuning options to set per operation halo extent and padding arguments. (#66)

Full Changelog: v0.4.2...v0.5.0

Assets 2

30 Oct 17:52

romerojosh

v0.4.2

7703aa0

v0.4.2

What's Changed

This patch release fixes several build related issues, including updating CMake include search paths for NVSHMEM 3.x support and improper naming of the single precision C2C benchmark executable. Other changes include small corrections to command line argument handling in the benchmark program and functionality updates to the Tayor Green example.

PRs included in this release

Update CMake NVSHMEM include search paths for NVSHMEM 3.x. (#34)
Fix integer conversion of skip_threshold in benchmark program. (#35)
Fix scaling overflow for large grids in R2C benchmark. Correct compilation defines for single precision C2C benchmark. (#37)
Taylor Green example updates. (#36)

Full Changelog: v0.4.1...v0.4.2

Assets 2

20 Apr 23:44

romerojosh

v0.4.1

38b6dea

v0.4.1

What's Changed

This patch release fixes a bug in processor dims handling during autotuning when supplying a fixed process grid introduced in v0.4.0.

PRs included in this release

Fix transposed pdims during autotuning. (#29)
Make CMake library include directory handling more robust. (#30)

Full Changelog: v0.4.0...v0.4.1

Assets 2

14 Mar 19:30

romerojosh

v0.4.0

b8ffecc

v.0.4.0

What's Changed

This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.

Breaking changes

#21 changed the attributetranspose_use_inplace_buffers in cudecompGridDescAutotuneOptions_t to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.

PRs included in this release

Allow to skip certain transpose operations during autotuning. (#16)
Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
Add CMake build (#15)
Enable autotuner to skip slow configurations via new skip_threshold option (#18)
Lowering CMake build optimization level for host code (#19)
Add support for CUTENSOR 2.0. (#20)
Enable per operation setting for in-place usage when autotuning. (#21)
Enable applying weights to individual transpose operation timings during autotuning (#22)
Add support for NCCL user buffer registration (#23)
Add MPI_Barrier call in NCCL initialization code. (#24)
Make CMake detection of NVHPC compilers more robust (#26)
Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)

Full Changelog: v0.3.1...v0.4.0

Assets 2

18 May 16:17

romerojosh

v0.3.1

3cbf456

v0.3.1 Pre-release

Pre-release

This patch release includes bug fixes in the handling of large message sizes with NVSHMEM backend.

Bugfixes:

Fixed handling of large message sizes in NVSHMEM backend. (#13)

Assets 2

24 Apr 17:01

romerojosh

v0.3.0

1e7c80e

v0.3.0 Pre-release

Pre-release

This release includes bug fixes in the handling of user-provided MPI communicators and processor grid configurations yielding empty pencils.

Bugfixes:

Fixed handling of user-provided MPI communicators. (#7)
Fixed handling of processor grid configurations yielding empty pencils. (#11, #12)

Assets 2

07 Sep 19:22

romerojosh

v0.2.0

e364ed5

v0.2.0 Pre-release

Pre-release

This release includes some minor bug fixes and quality of life improvements.

Changes:

Renaming of optional arguments in Fortran interface. (#2)

Bugfixes:

Fixed indexing bug in cudecompGetShiftedRank in Fortran interface. (#1)
Fixed bug with NCCL resource reclamation when using multiple grid descriptors. (#4)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

Breaking changes

Deprecations

PRs included in this release

Uh oh!

What's Changed

Breaking changes

Deprecations

PRs included in this release

Uh oh!

What's Changed

PRs included in this release

Uh oh!

What's Changed

PRs included in this release

Uh oh!

What's Changed

Breaking changes

PRs included in this release

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: NVIDIA/cuDecomp

v0.5.1

What's Changed

Breaking changes

Deprecations

PRs included in this release

Uh oh!

v0.5.0

What's Changed

Breaking changes

Deprecations

PRs included in this release

Uh oh!

v0.4.2

What's Changed

PRs included in this release

Uh oh!

v0.4.1

What's Changed

PRs included in this release

Uh oh!

v.0.4.0

What's Changed

Breaking changes

PRs included in this release

Uh oh!

v0.3.1

Uh oh!

v0.3.0

Uh oh!

v0.2.0

Uh oh!