CUB 1.10.0 is the major release accompanying the NVIDIA HPC SDK 20.9 release. It drops support for C++03, GCC < 5, Clang < 6, and MSVC < 2017. It also overhauls CMake support. Finally, we now have a Code of Conduct for contributors: https://github.com/thrust/cub/blob/main/CODE_OF_CONDUCT.md
- C++03 is no longer supported.
- GCC < 5, Clang < 6, and MSVC < 2017 are no longer supported.
- C++11 is deprecated.
Using this dialect will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
orCUB_IGNORE_DEPRECATED_CPP_11
. Suppression is only a short term solution. We will be dropping support for C++11 in the near future. - CMake < 3.15 is no longer supported.
- The default branch on GitHub is now called
main
.
- Added install targets to CMake builds.
- C++17 support.
- NVIDIA/thrust#1244: Check for macro collisions with system headers during header testing.
- thrust/thrust#1153: Switch to placement new instead of assignment to construct items in uninitialized memory. Thanks to Hugh Winkler for this contribution.
- thrust/cub#38: Fix
cub::DeviceHistogram
forsize_t
OffsetT
s. Thanks to Leo Fang for this contribution. - thrust/cub#35: Fix GCC-5 maybe-uninitialized warning. Thanks to Rong Ou for this contribution.
- thrust/cub#36: Qualify namespace for
va_printf
in_CubLog
. Thanks to Andrei Tchouprakov for this contribution.
CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release and the CUDA Toolkit 11.1 release.
- NVIDIA/thrust#1217: Move static local in cub::DeviceCount to a separate host-only function because NVC++ doesn't support static locals in host-device functions.
Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release.
It adds CMake find_package
support.
C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
Starting with the upcoming 1.10.0 release, C++03 support will be dropped
entirely.
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated.
Using these dialects will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP_11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
Using these compilers will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
- CMake
find_package
support. Just point CMake at thecmake
folder in your CUB include directory (ex:cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .
) and then you can add CUB to your CMake project withfind_package(CUB REQUIRED CONFIG)
.
CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. It introduces CMake support, version macros, platform detection machinery, and support for NVC++, which uses Thrust (and thus CUB) to implement GPU-accelerated C++17 Parallel Algorithms. Additionally, the scan dispatch layer was refactored and modernized. C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. Starting with the upcoming 1.10.0 release, C++03 support will be dropped entirely.
- Thrust now checks that it is compatible with the version of CUB found in your include path, generating an error if it is not. If you are using your own version of CUB, it may be too old. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
- C++03 and C++11 are deprecated.
Using these dialects will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_CPP_DIALECT
(to suppress C++03 and C++11 deprecation warnings) orCUB_IGNORE_DEPRECATED_CPP11
(to suppress C++11 deprecation warnings). Suppression is only a short term solution. We will be dropping support for C++03 in the 1.10.0 release and C++11 in the near future. - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
Using these compilers will generate a compile-time warning.
These warnings can be suppressed by defining
CUB_IGNORE_DEPRECATED_COMPILER
. Suppression is only a short term solution. We will be dropping support for these compilers in the near future.
- CMake support. Thanks to Francis Lemaire for this contribution.
- Refactorized and modernized scan dispatch layer. Thanks to Francis Lemaire for this contribution.
- Policy hooks for device-wide reduce, scan, and radix sort facilities to simplify tuning and allow users to provide custom policies. Thanks to Francis Lemaire for this contribution.
<cub/version.cuh>
:CUB_VERSION
,CUB_VERSION_MAJOR
,CUB_VERSION_MINOR
,CUB_VERSION_SUBMINOR
, andCUB_PATCH_NUMBER
.- Platform detection machinery:
<cub/util_cpp_dialect.cuh>
: Detects the C++ standard dialect.<cub/util_compiler.cuh>
: host and device compiler detection.<cub/util_deprecated.cuh>
:CUB_DEPRECATED
.- <cub/config.cuh>
: Includes
<cub/util_arch.cuh>,
<cub/util_compiler.cuh>,
<cub/util_cpp_dialect.cuh>,
<cub/util_deprecated.cuh>,
<cub/util_macro.cuh>,
<cub/util_namespace.cuh>`
cub::DeviceCount
andcub::DeviceCountUncached
, caching abstractions forcudaGetDeviceCount
.
- Lazily initialize the per-device CUDAattribute caches, because CUDA context creation is expensive and adds up with large CUDA binaries on machines with many GPUs. Thanks to the NVIDIA PyTorch team for bringing this to our attention.
- Make
cub::SwitchDevice
avoid setting/resetting the device if the current device is the same as the target device.
- Add explicit failure parameter to CAS in the CUB attribute cache to workaround a GCC 4.8 bug.
- Revert a change in reductions that changed the signedness of the
lane_id
variable to suppress a warning, as this introduces a bug in optimized device code. - Fix initialization in
cub::ExclusiveSum
. Thanks to Conor Hoekstra for this contribution. - Fix initialization of the
std::array
in the CUB attribute cache. - Fix
-Wsign-compare
warnings. Thanks to Elias Stehle for this contribution. - Fix
test_block_reduce.cu
to build without parameters. Thanks to Francis Lemaire for this contribution. - Add missing includes to
grid_even_share.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
thread_search.cuh
. Thanks to Francis Lemaire for this contribution. - Add missing includes to
cub.cuh
. Thanks to Felix Kallenborn for this contribution.
CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms.
CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit. When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.
- (C++11 or later) Cache calls to
cudaFuncGetAttributes
andcudaDeviceGetAttribute
withincub::PtxVersion
andcub::SmVersion
. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention. DispatchReduce
now takes anOutputT
template parameter so that users can specify the intermediate type explicitly.- Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.
- Change initialization style from copy initialization to direct initialization
(which is more permissive) in
AgentReduce
to allow a wider range of types to be used with it. - Fix bad signed/unsigned comparisons in
WarpReduce
. - Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.
CUB 1.8.0 introduces changes to the cub::Shuffle*
interfaces.
- The interfaces of
cub::ShuffleIndex
,cub::ShuffleUp
, andcub::ShuffleDown
have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.
- #112: Fix
cub::WarpScan
's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
CUB 1.7.5 adds support for radix sorting __half
keys and improved sorting
performance for 1 byte keys.
It was incorporated into Thrust 1.9.2.
- Radix sort support for
__half
keys. - Radix sort tuning policy updates to improve 1 byte key performance.
- Syntax tweaks to mollify Clang.
- #127:
cub::DeviceRunLengthEncode::Encode
returns incorrect results. - #128: 7-bit sorting passes fail for SM61 with large values.
CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.
- #114: Can't pair non-trivially-constructible values in radix sort.
- #115:
cub::WarpReduce
segmented reduction is broken in CUDA 9 for logical warp sizes smaller than 32.
CUB 1.7.3 is a minor release.
- #110:
cub::DeviceHistogram
null-pointer exception bug for iterator inputs.
CUB 1.7.2 is a minor release.
- #104: Device-wide reduction is now "run-to-run" deterministic for pseudo-associative reduction operators (like floating point addition).
CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a number of bug fixes.
- Radix sort tuning policies updated for SM7x (Volta).
- #104:
uint64_t
cub::WarpReduce
broken for CUB 1.7.0 on CUDA 8 and older. - #103: Can't mix Thrust from CUDA 9.0 and CUB.
- #102: CUB pulls in
windows.h
which definesmin
/max
macros that conflict withstd::min
/std::max
. - #99: Radix sorting crashes NVCC on Windows 10 for SM52.
- #98: cuda-memcheck: --tool initcheck failed with lineOfSight.
- #94: Git clone size.
- #93: Accept iterators for segment offsets.
- #87: CUB uses anonymous unions which is not valid C++.
- #44: Check for C++11 is incorrect for Visual Studio 2013.
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.0-5.
- Remove
cub::WarpAll
andcub::WarpAny
. These functions served to emulate__all
and__any
functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
- #86: Incorrect results with reduce-by-key.
CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.
- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.
- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
- #65:
cub::DeviceSegmentedRadixSort
should allow inputs to have pointer-to-const type. - Mollify Clang device-side warnings.
- Remove out-dated MSVC project files.
CUB 1.6.3 improves support for Windows, changes
cub::BlockLoad
/cub::BlockStore
interface to take the local data type,
and enhances radix sort performance for SM6x (Pascal) GPUs.
cub::BlockLoad
andcub::BlockStore
are now templated by the local data type, instead of theIterator
type. This allows for output iterators havingvoid
as theirvalue_type
(e.g. discard iterators).
- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
- Improved support for Windows (warnings, alignment, etc).
- #74:
cub::WarpReduce
executes reduction operator for out-of-bounds items. - #72:
cub:InequalityWrapper::operator
should be non-const. - #71:
cub::KeyValuePair
won't work ifKey
has non-trivial constructor. - #69: cub::BlockStore::Store
doesn't compile if
OutputIteratorT::value_typeisn't
T`. - #68:
cub::TilePrefixCallbackOp::WarpReduce
doesn't permit PTX arch specialization.
CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.
- Radix sort tuning policies updated for SM6x (Pascal) GPUs.
- Fix AArch64 compilation of
cub::CachingDeviceAllocator
.
CUB 1.6.1 (previously 1.5.4) is a minor release.
- Fix radix sorting bug introduced by scan refactorization.
CUB 1.6.0 changes the scan and reduce interfaces. Exclusive scans now accept an "initial value" instead of an "identity value". Scans and reductions now support differing input and output sequence types. Additionally, many bugs have been fixed.
- Device/block/warp-wide exclusive scans have been revised to now accept an "initial value" (instead of an "identity value") for seeding the computation with an arbitrary prefix.
- Device-wide reductions and scans can now have input sequence types that are different from output sequence types (as long as they are convertible).
- Reduce repository size by moving the doxygen binary to doc repository.
- Minor reduction in
cub::BlockScan
instruction counts.
- Issue #55: Warning in
cub/device/dispatch/dispatch_reduce_by_key.cuh
. - Issue #59:
cub::DeviceScan::ExclusiveSum
can't prefix sum of float into double. - Issue #58: Infinite loop in
cub::CachingDeviceAllocator::NearestPowerOf
. - Issue #47:
cub::CachingDeviceAllocator
needs to clean up CUDA global error state upon successful retry. - Issue #46: Very high amount of needed memory from the
cub::DeviceHistogram::HistogramEven
. - Issue #45:
cub::CachingDeviceAllocator
fails with debug output enabled
CUB 1.5.2 enhances cub::CachingDeviceAllocator
and improves scan performance
for SM5x (Maxwell).
- Improved medium-size scan performance on SM5x (Maxwell).
- Refactored
cub::CachingDeviceAllocator
:- Now spends less time locked.
- Uses C++11's
std::mutex
when available. - Failure to allocate a block from the runtime will retry once after freeing cached allocations.
- Now respects max-bin, fixing an issue where blocks in excess of max-bin were still being retained in the free cache.
- Fix for generic-type reduce-by-key
cub::WarpScan
for SM3x and newer GPUs.
CUB 1.5.1 is a minor release.
- Fix for incorrect
cub::DeviceRadixSort
output for some small problems on SM52 (Mawell) GPUs. - Fix for macro redefinition warnings when compiling
thrust::sort
.
CUB 1.5.0 introduces segmented sort and reduction primitives.
- Segmented device-wide operations for device-wide sort and reduction primitives.
- #36:
cub::ThreadLoad
generates compiler errors when loading from pointer-to-const. - #29:
cub::DeviceRadixSort::SortKeys<bool>
yields compiler errors. - #26: Misaligned address after
cub::DeviceRadixSort::SortKeys
. - #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
- Fix small radix sorting problems where 0 temporary bytes were required and
users code was invoking
malloc(0)
on some systems where that returnsNULL
. CUB assumed the user was asking for the size again and not running the sort.
CUB 1.4.1 is a minor release.
- Allow
cub::DeviceRadixSort
andcub::BlockRadixSort
on bool types.
- Fix minor CUDA 7.0 performance regressions in
cub::DeviceScan
andcub::DeviceReduceByKey
. - Remove requirement for callers to define the
CUB_CDP
macro when invoking CUB device-wide rountines using CUDA dynamic parallelism. - Fix headers not being included in the proper order (or missing includes) for some block-wide functions.
CUB 1.4.0 adds cub::DeviceSpmv
, cub::DeviceRunLength::NonTrivialRuns
,
improves cub::DeviceHistogram
, and introduces support for SM5x (Maxwell)
GPUs.
cub::DeviceSpmv
methods for multiplying sparse matrices by dense vectors, load-balanced using a merge-based parallel decomposition.cub::DeviceRadixSort
sorting entry-points that always return the sorted output into the specified buffer, as opposed to thecub::DoubleBuffer
in which it could end up in either buffer.cub::DeviceRunLengthEncode::NonTrivialRuns
for finding the starting offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in a given sequence. Useful for top-down partitioning algorithms like MSD sorting of very-large keys.
- Support and performance tuning for SM5x (Maxwell) GPUs.
- Updated cub::DeviceHistogram implementation that provides the same "histogram-even" and "histogram-range" functionality as IPP/NPP. Provides extremely fast and, perhaps more importantly, very uniform performance response across diverse real-world datasets, including pathological (homogeneous) sample distributions.
CUB 1.3.2 is a minor release.
- Fix
cub::DeviceReduce
where reductions of small problems (small enough to only dispatch a single thread block) would run in the default stream (stream zero) regardless of whether an alternate stream was specified.
CUB 1.3.1 is a minor release.
- Workaround for a benign WAW race warning reported by cuda-memcheck
in
cub::BlockScan
specialized forBLOCK_SCAN_WARP_SCANS
algorithm. - Fix bug in
cub::DeviceRadixSort
where the algorithm may sort more key bits than the caller specified (up to the nearest radix digit). - Fix for ~3%
cub::DeviceRadixSort
performance regression on SM2x (Fermi) and SM3x (Kepler) GPUs.
CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide
primitives and adds an enhanced version of cub::WarpScan
.
- CUB's collective (block-wide, warp-wide) primitives underwent a minor
interface refactoring:
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized by
X, Y, and Z block dimensions (with
BLOCK_DIM_Y
andBLOCK_DIM_Z
being optional, andBLOCK_DIM_X
replacingBLOCK_THREADS
). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks. - To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.
- To provide the appropriate support for multidimensional thread blocks,
The interfaces for collective classes are now template-parameterized by
X, Y, and Z block dimensions (with
- Added "combination"
cub::WarpScan
methods for efficiently computing both inclusive and exclusive prefix scans (and sums).
- Fix for bug in
cub::WarpScan
(which affectedcub::BlockScan
andcub::DeviceScan
) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64). - Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
- Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.
CUB 1.2.3 is a minor release.
- Fixed access violation bug in
cub::DeviceReduce::ReduceByKey
for non-primitive value types. - Fixed code-snippet bug in
ArgIndexInputIteratorT
documentation.
CUB 1.2.2 adds a new variant of cub::BlockReduce
and MSVC project solections
for examples.
- MSVC project solutions for device-wide and block-wide examples
- New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).
- Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.
CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey
and
cub::DeviceReduce::RunLengthEncode
and support for CUDA 6.0.
cub::DeviceReduce::ReduceByKey
.cub::DeviceReduce::RunLengthEncode
.
- Improved
cub::DeviceScan
,cub::DeviceSelect
,cub::DevicePartition
performance. - Documentation and testing:
- Added performance-portability plots for many device-wide primitives.
- Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Revised the operation of temporary tile status bookkeeping for
cub::DeviceScan
(and similar) to be safe for current code run on future platforms (now uses proper fences).
- Fix
cub::DeviceScan
bug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status. - Fix
cub::BlockScan
bug where certain exclusive scans on custom data types for theBLOCK_SCAN_WARP_SCANS
variant would return incorrect results for the first thread in the block. - Added workaround to make
cub::TexRefInputIteratorT
work with CUDA 6.0.
CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting,
cub::DeviceSelect
, cub::DevicePartition
, cub::Shuffle*
, and
cub::MaxSMOccupancy
.
Additionally, scan and sort performance for older GPUs has been improved and
many bugs have been fixed.
- Refactored block-wide I/O (
cub::BlockLoad
andcub::BlockStore
), removing cache-modifiers from their interfaces.cub::CacheModifiedInputIterator
andcub::CacheModifiedOutputIterator
should now be used withcub::BlockLoad
andcub::BlockStore
to effect that behavior.
cub::TexObjInputIterator
,cub::TexRefInputIterator
,cub::CacheModifiedInputIterator
, andcub::CacheModifiedOutputIterator
types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.- Descending sorting for
cub::DeviceRadixSort
andcub::BlockRadixSort
. - Min, max, arg-min, and arg-max operators for
cub::DeviceReduce
. cub::DeviceSelect
(select-unique, select-if, and select-flagged).cub::DevicePartition
(partition-if, partition-flagged).- Generic
cub::ShuffleUp
,cub::ShuffleDown
, andcub::ShuffleIndex
for warp-wide communication of arbitrary data types (SM3x and up). cub::MaxSmOccupancy
for accurately determining SM occupancy for any given kernel function pointer.
- Improved
cub::DeviceScan
andcub::DeviceRadixSort
performance for older GPUs (SM1x to SM3x). - Renamed device-wide
stream_synchronous
param todebug_synchronous
to avoid confusion about usage. - Documentation improvements:
- Added simple examples of device-wide methods.
- Improved doxygen documentation and example snippets.
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).
- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
- SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
- For
cub::WarpScan
-based scans, not all threads in the first warp were entering the prefix callback functor. cub::DeviceRadixSort
had a race condition with key-value pairs for pre-SM35 architectures.cub::DeviceRadixSor
bitfield-extract behavior with long keys on 64-bit Linux was incorrect.cub::BlockDiscontinuity
failed to compile for types other thanint32_t
/uint32_t
.- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.
CUB 1.0.2 is a minor release.
- Corrections to code snippet examples for
cub::BlockLoad
,cub::BlockStore
, andcub::BlockDiscontinuity
. - Cleaned up unnecessary/missing header includes.
You can now safely include a specific .cuh (instead of
cub.cuh
). - Bug/compilation fixes for
cub::BlockHistogram
.
CUB 1.0.1 adds cub::DeviceRadixSort
and cub::DeviceScan
.
Numerous other performance and correctness fixes and included.
- New collective interface idiom (specialize/construct/invoke).
cub::DeviceRadixSort
. Implements short-circuiting for homogenous digit passes.cub::DeviceScan
. Implements single-pass "adaptive-lookback" strategy.
- Significantly improved documentation (with example code snippets).
- More extensive regression test suit for aggressively testing collective variants.
- Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types).
- Improved support for SM3x SHFL (collective ops now use SHFL for types larger than 32 bits).
- Better code generation for 64-bit addressing within
cub::BlockLoad
/cub::BlockStore
. cub::DeviceHistogram
now supports histograms of arbitrary bins.- Updates to accommodate CUDA 5.5 dynamic parallelism.
- Workarounds for SM10 codegen issues in uncommonly-used
cub::WarpScan
/cub::WarpReduce
specializations.
CUB 0.9.3 is a minor release.
- Various documentation updates and corrections.
- Fixed compilation errors for SM1x.
- Fixed compilation errors for some WarpScan entrypoints on SM3x and up.
CUB 0.9.3 adds histogram algorithms and work management utility descriptors.
cub::DevicHistogram256
.cub::BlockHistogram256
.cub::BlockScan
algorithm variantBLOCK_SCAN_RAKING_MEMOIZE
, which trades more register consumption for less shared memory I/O.cub::GridQueue
,cub::GridEvenShare
, work management utility descriptors.
- Updates to
cub::BlockRadixRank
to usecub::BlockScan
, which improves performance on SM3x by using SHFL. - Allow types other than builtin types to be used in
cub::WarpScan::*Sum
methods if they only haveoperator+
overloaded. Previously they also required to support assignment fromint(0)
. - Update
cub::BlockReduce
'sBLOCK_REDUCE_WARP_REDUCTIONS
algorithm to work even when block size is not an even multiple of warp size. - Refactoring of
cub::DeviceAllocator
interface andcub::CachingDeviceAllocator
implementation.
CUB 0.9.2 adds cub::WarpReduce
.
cub::WarpReduce
, which uses the SHFL instruction when applicable.cub::BlockReduce
now uses thiscub::WarpReduce
instead of implementing its own.
- Documentation updates and corrections.
- Fixes for 64-bit Linux compilation warnings and errors.
CUB 0.9.1 is a minor release.
- Fix for ambiguity in
cub::BlockScan::Reduce
between generic reduction and summation. Summation entrypoints are now called::Sum()
, similar to the convention incub::BlockScan
. - Small edits to documentation and download tracking.
Initial preview release. CUB is the first durable, high-performance library of cooperative block-level, warp-level, and thread-level primitives for CUDA kernel programming.