Release v0.5.1 · NVIDIA/cuDecomp

What's Changed

This release includes a couple of new features in cuDecomp, minor performance improvements, and documentation fixes. This release adds the ability to capture packing kernels for the pipelined backends in CUDA graphs for better latency in some cases (enabled via new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS, see #68 for more details). This release also adds a new performance reporting feature to report timing breakdowns of the communication and local processing time of transpose and halo operations launched in a user workload (enabled via new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT, see #75 for more details).

Breaking changes

None.

Deprecations

None.

PRs included in this release

Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. (#68)
Fix Fortran documentation of transpose_halo_extents and transpose_padding autotuning options. (#70)
Better guarding of calls to cudecompAlltoall and cudecompAlltoallPipelined in transpose implementation. (#71)
Upgrade C++ standard for builds to C++17 for libcu++ compatibility. (#72)
Improve initial NVSHMEM team synchronization for transpose ops. (#74)
Improvements to NCCL communicator management. (#73)
Fix non-nvshmem builds. (#76)
Use MPI_Alltoall instead of MPI_Alltoallv if able. (#77)
Fix undefined NVML_GPU_FABRIC_UUID_LEN usage for builds against older CUDA toolkits. (#78)
Add basic CI for compilation testing and code format checks. (#79)
Pin clang format version for CI. (#80)
Add performance reporting feature. (#75)
Fix C++ std::filesystem linking for older GCC toolchains. (#81)
Documentation build updates and automation. (#82)
Improve NVSHMEM backend CE scheduling. (#83)

Full Changelog: v0.5.0...v0.5.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5.1

What's Changed

Breaking changes

Deprecations

PRs included in this release

Uh oh!