Skip to content

Releases: NVIDIA/cuDecomp

v0.5.1

13 Aug 20:09
b34e3f2
Compare
Choose a tag to compare

What's Changed

This release includes a couple of new features in cuDecomp, minor performance improvements, and documentation fixes. This release adds the ability to capture packing kernels for the pipelined backends in CUDA graphs for better latency in some cases (enabled via new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS, see #68 for more details). This release also adds a new performance reporting feature to report timing breakdowns of the communication and local processing time of transpose and halo operations launched in a user workload (enabled via new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT, see #75 for more details).

Breaking changes

None.

Deprecations

None.

PRs included in this release

  • Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. (#68)
  • Fix Fortran documentation of transpose_halo_extents and transpose_padding autotuning options. (#70)
  • Better guarding of calls to cudecompAlltoall and cudecompAlltoallPipelined in transpose implementation. (#71)
  • Upgrade C++ standard for builds to C++17 for libcu++ compatibility. (#72)
  • Improve initial NVSHMEM team synchronization for transpose ops. (#74)
  • Improvements to NCCL communicator management. (#73)
  • Fix non-nvshmem builds. (#76)
  • Use MPI_Alltoall instead of MPI_Alltoallv if able. (#77)
  • Fix undefined NVML_GPU_FABRIC_UUID_LEN usage for builds against older CUDA toolkits. (#78)
  • Add basic CI for compilation testing and code format checks. (#79)
  • Pin clang format version for CI. (#80)
  • Add performance reporting feature. (#75)
  • Fix C++ std::filesystem linking for older GCC toolchains. (#81)
  • Documentation build updates and automation. (#82)
  • Improve NVSHMEM backend CE scheduling. (#83)

Full Changelog: v0.5.0...v0.5.1

v0.5.0

08 Apr 22:36
1083144
Compare
Choose a tag to compare

What's Changed

This release includes a number of major updates to cuDecomp. This release adds new features to make cuDecomp more flexible for users (more customizable memory orderings by pencil axis via new transpose_mem_order configuration option and support for input/output buffer padding in transpose and halo update APIs). This release also improves support for multi-node NVLINK (MNNVL) equipped clusters with opt-in support for fabric allocated cuDecomp workspace memory. Beyond this, this release includes expanded autotuning options and general improvements.

Breaking changes

  • #60 adds a new padding argument to several cuDecomp APIs: cudecompGetPencilInfo, cudecompTranspose*, and cudecompHaloUpdate* functions. This will require updates to existing C++ code and Fortran code (depending on usage). See #60 and documentation for more details.

Deprecations

  • The Makefile-based build has been removed.

PRs included in this release

  • Made it possible to include library header from pure C program (#40)
  • Adding Fortran version of Taylor Green example (#41)
  • Fix integer overflow issue with C++ TG example for large problems. (#42)
  • Benchmark updates (#43)
  • Use unique ID based NVSHMEM initialization method for newer NVSHMEM versions (#44)
  • Removing Makefile build support and related files. (#45)
  • Add missing preprocessor guards to fix compilation without NVSHMEM enabled. (#46)
  • Add small MPI_Alltoall after autotuning to work around MPI memory registration delaying cudaFree. (#47)
  • Address narrowing conversion errors/warnings. (#48)
  • Add new transpose_mem_order configuration argument to enable more flexible pencil memory layouts. (#49)
  • Add opt-in support for fabric-registered workspace allocations via cuMem* APIs. (#50)
  • Dynamically load CUDA driver functions at runtime. (#51)
  • Increase buffer size used in post-autotuning MPI_Alltoall. (#52)
  • Fix integer overflow issue in Fortran poisson example. (#53)
  • Extend transpose shortcut handling to cases with halos. (#54)
  • Fix bug in handling of NVSHMEM halo backends from recent change. (#55)
  • Improve multi-node NVLink topology detection and communication ordering using NVML utilities. (#56)
  • Fix CUDART_VERSION guard for nvmlDeviceGetGpuFabricInfoV to restrict usage to CUDA >= 12.4. (#57)
  • Silence messages about NVML symbols failing to load. (#58)
  • Improve tests (#59)
  • Preserve original user transpose_mem_order settings after grid descriptor creation. (#61)
  • Add support for padded input/output buffers in transpose and halo communication routines (#60)
  • Improvements to batched memcpy kernel implementation. (#62)
  • Remove redundant axis-contiguous/transpose_mem_order configurations from halo tests. Update axis-contiguous test configurations to not supply transpose_mem_order argument. (#63)
  • Add Blackwell (cc100) support to default builds when using CUDA 12.8 or newer. (#64)
  • C++ Taylor Green example updates. (#65)
  • Add new autotuning options to set per operation halo extent and padding arguments. (#66)

Full Changelog: v0.4.2...v0.5.0

v0.4.2

30 Oct 17:52
7703aa0
Compare
Choose a tag to compare

What's Changed

This patch release fixes several build related issues, including updating CMake include search paths for NVSHMEM 3.x support and improper naming of the single precision C2C benchmark executable. Other changes include small corrections to command line argument handling in the benchmark program and functionality updates to the Tayor Green example.

PRs included in this release

  • Update CMake NVSHMEM include search paths for NVSHMEM 3.x. (#34)
  • Fix integer conversion of skip_threshold in benchmark program. (#35)
  • Fix scaling overflow for large grids in R2C benchmark. Correct compilation defines for single precision C2C benchmark. (#37)
  • Taylor Green example updates. (#36)

Full Changelog: v0.4.1...v0.4.2

v0.4.1

20 Apr 23:44
Compare
Choose a tag to compare

What's Changed

This patch release fixes a bug in processor dims handling during autotuning when supplying a fixed process grid introduced in v0.4.0.

PRs included in this release

  • Fix transposed pdims during autotuning. (#29)
  • Make CMake library include directory handling more robust. (#30)

Full Changelog: v0.4.0...v0.4.1

v.0.4.0

14 Mar 19:30
b8ffecc
Compare
Choose a tag to compare

What's Changed

This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.

Breaking changes

  • #21 changed the attributetranspose_use_inplace_buffers in cudecompGridDescAutotuneOptions_t to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.

PRs included in this release

  • Allow to skip certain transpose operations during autotuning. (#16)
  • Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
  • Add CMake build (#15)
  • Enable autotuner to skip slow configurations via new skip_threshold option (#18)
  • Lowering CMake build optimization level for host code (#19)
  • Add support for CUTENSOR 2.0. (#20)
  • Enable per operation setting for in-place usage when autotuning. (#21)
  • Enable applying weights to individual transpose operation timings during autotuning (#22)
  • Add support for NCCL user buffer registration (#23)
  • Add MPI_Barrier call in NCCL initialization code. (#24)
  • Make CMake detection of NVHPC compilers more robust (#26)
  • Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)

Full Changelog: v0.3.1...v0.4.0

v0.3.1

18 May 16:17
Compare
Choose a tag to compare
v0.3.1 Pre-release
Pre-release

This patch release includes bug fixes in the handling of large message sizes with NVSHMEM backend.

Bugfixes:

  • Fixed handling of large message sizes in NVSHMEM backend. (#13)

v0.3.0

24 Apr 17:01
Compare
Choose a tag to compare
v0.3.0 Pre-release
Pre-release

This release includes bug fixes in the handling of user-provided MPI communicators and processor grid configurations yielding empty pencils.

Bugfixes:

  • Fixed handling of user-provided MPI communicators. (#7)
  • Fixed handling of processor grid configurations yielding empty pencils. (#11, #12)

v0.2.0

07 Sep 19:22
Compare
Choose a tag to compare
v0.2.0 Pre-release
Pre-release

This release includes some minor bug fixes and quality of life improvements.

Changes:

  • Renaming of optional arguments in Fortran interface. (#2)

Bugfixes:

  • Fixed indexing bug in cudecompGetShiftedRank in Fortran interface. (#1)
  • Fixed bug with NCCL resource reclamation when using multiple grid descriptors. (#4)