Conversation
This comment has been minimized.
This comment has been minimized.
miscco
left a comment
There was a problem hiding this comment.
This is not using SIMD on the host, is there any reason for that?
|
because this is the first PR. Secondly, because we care more about GPU than CPU. Third, the feature is also experimental for other std libraries. |
This comment has been minimized.
This comment has been minimized.
mhoemmen
left a comment
There was a problem hiding this comment.
Per offline discussion, Federico will first update to the latest draft N5032 ( https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5032.pdf ).
|
|
||
| #include <cuda/std/__cccl/prologue.h> | ||
|
|
||
| namespace cuda::experimental::datapar |
There was a problem hiding this comment.
WG21 adopted P3691R1 at the June 2025 Sofia meeting. This renamed the namespace to "simd" and renamed the data types basic_mask / mask and basic_vec / vec.
|
|
||
| namespace cuda::experimental::datapar | ||
| { | ||
| namespace simd_abi |
There was a problem hiding this comment.
[simd] does not declare any public namespaces other than std::simd.
| template <int _Np> | ||
| using fixed_size = __fixed_size<_Np>; | ||
|
|
||
| template <typename> | ||
| using compatible = fixed_size<1>; | ||
|
|
||
| template <typename> | ||
| using native = fixed_size<1>; |
There was a problem hiding this comment.
native-abi is exposition-only (see [simd.expos.abi]). The others don't appear to be named at all in [simd].
| template <typename _Tp, typename _Abi> | ||
| class basic_simd; | ||
|
|
||
| template <typename _Tp, int _Np> | ||
| using simd = basic_simd<_Tp, simd_abi::fixed_size<_Np>>; |
There was a problem hiding this comment.
The synopsis declares basic_vec like this, where @_X_@ indicates italic code-font X (meaning that it's an exposition-only name).
template<class T, class Abi = @_native-abi_@<T>> class basic_vec;
template<class T, @_simd-size-type_@ N = @_simd-size-v_@<T, @_native-abi_@<T>>>
using vec = basic_vec<T, @_deduce-abi-t_@<T, N>>;
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
😬 CI Workflow Results🟥 Finished in 16m 54s: Pass: 12%/48 | Total: 5h 17m | Max: 16m 27s | Hits: 99%/1356See results here. |
Motivations
Modern GPU architectures are increasingly exposing fine-grained, single-thread SIMD capabilities to maximize throughput within individual CUDA threads. While GPU programming model strongly focuses on the SIMT model, newer hardware relies on specialized SIMD operations to saturate execution units. Some examples include:
int16_tSIMD instructions DPX.FADDx2,FMULx2,FMAx2.Bfloat16x2andHalfx2intrinsics.IADD3).__dp4a.vabsdiff4.C++26 std::simd provides a standardized abstraction to write vectorized code. This is a great opportunity to unify customized code to handle all variants and reduce CUDA software fragmentation. By adopting
std::simd-like API, developers can write a single vectorized kernel that compiles to the optimal instructions for any GPU architecture.PR Goals and Non-Goals
The PR aims to provide a basic implementation of
std::simdand provide the foundation for future optimizations and extensions.Advanced math and bit operations, e.g.
std::abs,std::pow,std::popcountetc. , as well asstd::complexbinding, are outside the scope of the first PR.Non-Goals:
std::simd.Implementation Notes
The implementation is based on the LLVM code experimental/__simd and extended to support the related C++ proposals:
Some optimizations are already exploited in the CCCL code, for example thread_simd.h and thread_reduce.h. They will gradually added to the implementation.
Partially address #30