-
Notifications
You must be signed in to change notification settings - Fork 283
[Backport 3.1]: [CUB] Replace several direct uses of __clz
(#6099)
#6202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Replace `__clz` in `warp_scan_shfl.cuh`. * Replace `__clz` in `block_radix_rank.cuh` * Replace `__clz` in `warp_reduce_shfl.cuh` * Replace `__clz` in `warp_reduce_smem.cuh` * Replace thrust's `clz` with `cuda::std::countl` * Fully qualify with `::cuda` * Fixup types or copy paste mistakes * Address review comments, `countr_zero` instead of `countl(brev())` * Use __bit_log2 for warp ballot index. * Use `__bit_log2` for block leader in ComputeRanksItem * Ensure that we static cast in `__clz` to int in case we deal with ARM * Rename variable to not conflict with builtin * Use `__bit_log2` * Fix incorrect transformation * Drop internal `clz` function in favor of `countl_zero` * Drop unneeded include * Fix return type of `__ballot_sync` to unsigned * fix typo * Be super safe about unsigned integers * Fix argument type in radix_rank --------- Co-authored-by: Michael Schellenberger Costa <[email protected]>
#6099, the source of this backport, omitted a performance analysis of the change. Let's wait with merging this until we know that it does not regress. |
🟨 CI finished in 2h 22m: Pass: 99%/205 | Total: 3d 14h | Avg: 25m 11s | Max: 1h 57m | Hits: 75%/338630
|
Project | |
---|---|
CCCL Infrastructure | |
CCCL Packaging | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | CCCL Packaging |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
# | Runner |
---|---|
128 | linux-amd64-cpu16 |
23 | windows-amd64-cpu16 |
14 | linux-amd64-gpu-h100-latest-1 |
14 | linux-amd64-gpu-rtxa6000-latest-1 |
12 | linux-arm64-cpu16 |
11 | linux-amd64-gpu-rtx2080-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
Integer num_bits = 8 * sizeof(Integer); | ||
Integer num_bits_minus_one = num_bits - 1; | ||
|
||
return num_bits_minus_one - clz(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::detail::log2
will be deleted in #6188
#include <cuda/std/__algorithm/clamp.h> | ||
#include <cuda/std/__bit/has_single_bit.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two headers are never used if I'm not wrong
#include <cuda/std/__algorithm/max.h> | ||
#include <cuda/std/__bit/integral.h> | ||
#include <cuda/std/__functional/operations.h> | ||
#include <cuda/std/__type_traits/conditional.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use cuda::std::_If
AFAIK
We shouldn't be addressing review comments in the backport unless there are material conflicts with the old branch. We should fix these issues in main first and either cherry-pick the changes to the backport or do another backport. That way we aren't cross-pollinating changes between branches. |
🟩 CI finished in 4h 47m: Pass: 100%/205 | Total: 4d 14h | Avg: 32m 13s | Max: 3h 06m | Hits: 75%/339883
|
Project | |
---|---|
CCCL Infrastructure | |
CCCL Packaging | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | CCCL Packaging |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
# | Runner |
---|---|
128 | linux-amd64-cpu16 |
23 | windows-amd64-cpu16 |
14 | linux-amd64-gpu-h100-latest-1 |
14 | linux-amd64-gpu-rtxa6000-latest-1 |
12 | linux-arm64-cpu16 |
11 | linux-amd64-gpu-rtx2080-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
Replace
__clz
inwarp_scan_shfl.cuh
.Replace
__clz
inblock_radix_rank.cuh
Replace
__clz
inwarp_reduce_shfl.cuh
Replace
__clz
inwarp_reduce_smem.cuh
Replace thrust's
clz
withcuda::std::countl
Fully qualify with
::cuda
Fixup types or copy paste mistakes
Address review comments,
countr_zero
instead ofcountl(brev())
Use __bit_log2 for warp ballot index.
Use
__bit_log2
for block leader in ComputeRanksItemEnsure that we static cast in
__clz
to int in case we deal with ARMRename variable to not conflict with builtin
Use
__bit_log2
Fix incorrect transformation
Drop internal
clz
function in favor ofcountl_zero
Drop unneeded include
Fix return type of
__ballot_sync
to unsignedfix typo
Be super safe about unsigned integers
Fix argument type in radix_rank