What's Changed
This release includes a couple of new features in cuDecomp, minor performance improvements, and documentation fixes. This release adds the ability to capture packing kernels for the pipelined backends in CUDA graphs for better latency in some cases (enabled via new environment variable CUDECOMP_ENABLE_CUDA_GRAPHS
, see #68 for more details). This release also adds a new performance reporting feature to report timing breakdowns of the communication and local processing time of transpose and halo operations launched in a user workload (enabled via new environment variable CUDECOMP_ENABLE_PERFORMANCE_REPORT
, see #75 for more details).
Breaking changes
None.
Deprecations
None.
PRs included in this release
- Improve packing kernel launch efficiency for pipelined backends using CUDA graphs. (#68)
- Fix Fortran documentation of transpose_halo_extents and transpose_padding autotuning options. (#70)
- Better guarding of calls to cudecompAlltoall and cudecompAlltoallPipelined in transpose implementation. (#71)
- Upgrade C++ standard for builds to C++17 for libcu++ compatibility. (#72)
- Improve initial NVSHMEM team synchronization for transpose ops. (#74)
- Improvements to NCCL communicator management. (#73)
- Fix non-nvshmem builds. (#76)
- Use MPI_Alltoall instead of MPI_Alltoallv if able. (#77)
- Fix undefined NVML_GPU_FABRIC_UUID_LEN usage for builds against older CUDA toolkits. (#78)
- Add basic CI for compilation testing and code format checks. (#79)
- Pin clang format version for CI. (#80)
- Add performance reporting feature. (#75)
- Fix C++ std::filesystem linking for older GCC toolchains. (#81)
- Documentation build updates and automation. (#82)
- Improve NVSHMEM backend CE scheduling. (#83)
Full Changelog: v0.5.0...v0.5.1