Skip to content

Implement parallel cuda::std::stable_partition#8238

Open
miscco wants to merge 2 commits intoNVIDIA:mainfrom
miscco:parallel_stable_partition
Open

Implement parallel cuda::std::stable_partition#8238
miscco wants to merge 2 commits intoNVIDIA:mainfrom
miscco:parallel_stable_partition

Conversation

@miscco
Copy link
Copy Markdown
Contributor

@miscco miscco commented Mar 31, 2026

This implements the stable partition algorithm for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #8228

@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 31, 2026
@miscco miscco requested review from a team as code owners March 31, 2026 14:59
@miscco miscco requested review from elstehle and pciolkosz March 31, 2026 14:59
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 31, 2026
@miscco
Copy link
Copy Markdown
Contributor Author

miscco commented Mar 31, 2026

Performance looks good, except for small arrays:

['thrust_stable_partition.json', 'pstl_stable_partition.json']
# base

## [1] NVIDIA RTX A6000

|  T{ct}  |  Elements  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|-----------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^16    |     1     |  30.179 us |       4.76% |  29.136 us |       3.93% |  -1.043 us |  -3.46% |   SAME   |
|   I8    |    2^20    |     1     |  36.844 us |       4.72% |  32.081 us |      13.19% |  -4.763 us | -12.93% |   FAST   |
|   I8    |    2^24    |     1     | 152.954 us |       0.78% | 153.300 us |       1.66% |   0.346 us |   0.23% |   SAME   |
|   I8    |    2^28    |     1     |   1.963 ms |       1.11% |   2.104 ms |       0.34% | 140.343 us |   7.15% |   SLOW   |
|   I8    |    2^16    |   0.544   |  30.298 us |       5.31% |  29.278 us |       5.05% |  -1.020 us |  -3.37% |   SAME   |
|   I8    |    2^20    |   0.544   |  36.571 us |       3.90% |  31.378 us |      10.85% |  -5.193 us | -14.20% |   FAST   |
|   I8    |    2^24    |   0.544   | 172.664 us |       0.66% | 175.749 us |       1.66% |   3.085 us |   1.79% |   SLOW   |
|   I8    |    2^28    |   0.544   |   2.511 ms |       0.73% |   2.659 ms |       0.48% | 148.345 us |   5.91% |   SLOW   |
|   I8    |    2^16    |     0     |  30.130 us |       3.65% |  29.193 us |       6.98% |  -0.937 us |  -3.11% |   SAME   |
|   I8    |    2^20    |     0     |  37.782 us |       3.99% |  32.574 us |      11.30% |  -5.207 us | -13.78% |   FAST   |
|   I8    |    2^24    |     0     | 204.069 us |       0.86% | 208.020 us |       1.38% |   3.951 us |   1.94% |   SLOW   |
|   I8    |    2^28    |     0     |   2.996 ms |       0.70% |   3.169 ms |       0.52% | 173.603 us |   5.80% |   SLOW   |
|   I16   |    2^16    |     1     |  23.812 us |       2.61% |  22.819 us |       4.78% |  -0.993 us |  -4.17% |   FAST   |
|   I16   |    2^20    |     1     |  39.898 us |       4.19% |  37.668 us |       5.10% |  -2.230 us |  -5.59% |   FAST   |
|   I16   |    2^24    |     1     | 233.253 us |       3.25% | 230.850 us |       1.54% |  -2.403 us |  -1.03% |   SAME   |
|   I16   |    2^28    |     1     |   3.311 ms |       0.90% |   3.356 ms |       0.60% |  44.972 us |   1.36% |   SLOW   |
|   I16   |    2^16    |   0.544   |  30.447 us |       5.87% |  29.085 us |       6.86% |  -1.362 us |  -4.47% |   SAME   |
|   I16   |    2^20    |   0.544   |  40.506 us |       1.98% |  38.793 us |       5.41% |  -1.713 us |  -4.23% |   FAST   |
|   I16   |    2^24    |   0.544   | 279.278 us |       0.74% | 278.312 us |       1.27% |  -0.966 us |  -0.35% |   SAME   |
|   I16   |    2^28    |   0.544   |   4.186 ms |       0.75% |   4.235 ms |       0.54% |  48.690 us |   1.16% |   SLOW   |
|   I16   |    2^16    |     0     |  30.059 us |       2.64% |  29.251 us |       5.49% |  -0.808 us |  -2.69% |   FAST   |
|   I16   |    2^20    |     0     |  41.284 us |       3.57% |  39.863 us |       5.06% |  -1.421 us |  -3.44% |   SAME   |
|   I16   |    2^24    |     0     | 342.282 us |       0.74% | 341.046 us |       1.13% |  -1.237 us |  -0.36% |   SAME   |
|   I16   |    2^28    |     0     |   5.165 ms |       0.63% |   5.211 ms |       0.54% |  45.747 us |   0.89% |   SLOW   |
|   I32   |    2^16    |     1     |  24.073 us |       5.62% |  22.907 us |       4.82% |  -1.167 us |  -4.85% |   FAST   |
|   I32   |    2^20    |     1     |  41.817 us |       1.84% |  35.707 us |       3.13% |  -6.110 us | -14.61% |   FAST   |
|   I32   |    2^24    |     1     | 416.785 us |       0.28% | 413.820 us |       0.64% |  -2.964 us |  -0.71% |   FAST   |
|   I32   |    2^28    |     1     |   6.334 ms |       0.52% |   6.354 ms |       0.10% |  20.521 us |   0.32% |   SLOW   |
|   I32   |    2^16    |   0.544   |  30.664 us |       5.28% |  29.494 us |       3.76% |  -1.170 us |  -3.81% |   FAST   |
|   I32   |    2^20    |   0.544   |  49.686 us |       1.95% |  43.188 us |       2.76% |  -6.498 us | -13.08% |   FAST   |
|   I32   |    2^24    |   0.544   | 507.327 us |       0.27% | 504.586 us |       0.55% |  -2.741 us |  -0.54% |   FAST   |
|   I32   |    2^28    |   0.544   |   7.887 ms |       0.52% |   7.905 ms |       0.08% |  17.372 us |   0.22% |   SLOW   |
|   I32   |    2^16    |     0     |  30.682 us |       2.64% |  29.760 us |       5.21% |  -0.922 us |  -3.01% |   FAST   |
|   I32   |    2^20    |     0     |  51.045 us |       2.36% |  45.657 us |       4.13% |  -5.388 us | -10.55% |   FAST   |
|   I32   |    2^24    |     0     | 615.270 us |       0.28% | 612.040 us |       0.49% |  -3.230 us |  -0.53% |   FAST   |
|   I32   |    2^28    |     0     |   9.506 ms |       0.62% |   9.528 ms |       0.08% |  21.585 us |   0.23% |   SLOW   |
|   I64   |    2^16    |     1     |  25.723 us |       4.89% |  23.506 us |       5.13% |  -2.217 us |  -8.62% |   FAST   |
|   I64   |    2^20    |     1     |  71.080 us |       1.27% |  65.907 us |       3.15% |  -5.174 us |  -7.28% |   FAST   |
|   I64   |    2^24    |     1     | 812.423 us |       0.29% | 807.194 us |       0.34% |  -5.229 us |  -0.64% |   FAST   |
|   I64   |    2^28    |     1     |  12.649 ms |       0.33% |  12.642 ms |       0.05% |  -6.832 us |  -0.05% |   FAST   |
|   I64   |    2^16    |   0.544   |  32.716 us |       5.12% |  30.384 us |       2.90% |  -2.332 us |  -7.13% |   FAST   |
|   I64   |    2^20    |   0.544   |  82.335 us |       1.29% |  75.231 us |       2.64% |  -7.104 us |  -8.63% |   FAST   |
|   I64   |    2^24    |   0.544   | 992.800 us |       0.17% | 989.138 us |       2.36% |  -3.662 us |  -0.37% |   FAST   |
|   I64   |    2^28    |   0.544   |  15.517 ms |       0.37% |  15.513 ms |       0.05% |  -3.684 us |  -0.02% |   SAME   |
|   I64   |    2^16    |     0     |  32.778 us |       4.66% |  30.362 us |       2.90% |  -2.416 us |  -7.37% |   FAST   |
|   I64   |    2^20    |     0     |  98.944 us |       1.11% |  91.952 us |       2.50% |  -6.992 us |  -7.07% |   FAST   |
|   I64   |    2^24    |     0     |   1.208 ms |       0.15% |   1.203 ms |       0.24% |  -4.716 us |  -0.39% |   FAST   |
|   I64   |    2^28    |     0     |  18.956 ms |       0.28% |  18.955 ms |       0.11% |  -0.489 us |  -0.00% |   SAME   |
|  I128   |    2^16    |     1     |  28.586 us |       3.99% |  23.575 us |       5.07% |  -5.011 us | -17.53% |   FAST   |
|  I128   |    2^20    |     1     | 122.405 us |       2.29% | 115.386 us |       1.65% |  -7.018 us |  -5.73% |   FAST   |
|  I128   |    2^24    |     1     |   1.605 ms |       0.17% |   1.603 ms |       0.24% |  -2.180 us |  -0.14% |   SAME   |
|  I128   |    2^28    |     1     |  25.365 ms |       0.19% |  25.387 ms |       0.05% |  21.911 us |   0.09% |   SLOW   |
|  I128   |    2^16    |   0.544   |  35.593 us |       2.31% |  30.330 us |       3.44% |  -5.264 us | -14.79% |   FAST   |
|  I128   |    2^20    |   0.544   | 149.305 us |       0.95% | 141.124 us |      15.91% |  -8.181 us |  -5.48% |   FAST   |
|  I128   |    2^24    |   0.544   |   1.989 ms |       1.80% |   1.975 ms |       0.20% | -14.218 us |  -0.71% |   FAST   |
|  I128   |    2^28    |   0.544   |  31.338 ms |       0.11% |  31.356 ms |       0.04% |  18.070 us |   0.06% |   SLOW   |
|  I128   |    2^16    |     0     |  36.027 us |       2.74% |  30.034 us |       3.14% |  -5.993 us | -16.64% |   FAST   |
|  I128   |    2^20    |     0     | 179.388 us |       0.64% | 170.127 us |       2.39% |  -9.260 us |  -5.16% |   FAST   |
|  I128   |    2^24    |     0     |   2.428 ms |       0.87% |   2.423 ms |       0.17% |  -4.314 us |  -0.18% |   FAST   |
|  I128   |    2^28    |     0     |  38.433 ms |       0.05% |  38.486 ms |       0.03% |  52.482 us |   0.14% |   SLOW   |
|   F32   |    2^16    |     1     |  24.539 us |       3.85% |  22.561 us |       4.74% |  -1.978 us |  -8.06% |   FAST   |
|   F32   |    2^20    |     1     |  42.586 us |       1.84% |  36.096 us |       2.81% |  -6.490 us | -15.24% |   FAST   |
|   F32   |    2^24    |     1     | 426.293 us |       2.75% | 414.904 us |       0.98% | -11.389 us |  -2.67% |   FAST   |
|   F32   |    2^28    |     1     |   6.366 ms |       0.61% |   6.364 ms |       0.10% |  -1.972 us |  -0.03% |   SAME   |
|   F32   |    2^16    |   0.544   |  31.225 us |       4.68% |  30.282 us |       4.21% |  -0.944 us |  -3.02% |   SAME   |
|   F32   |    2^20    |   0.544   |  51.748 us |       2.23% |  45.713 us |       5.47% |  -6.035 us | -11.66% |   FAST   |
|   F32   |    2^24    |   0.544   | 611.373 us |       0.30% | 607.623 us |       0.46% |  -3.749 us |  -0.61% |   FAST   |
|   F32   |    2^28    |   0.544   |   9.369 ms |       0.53% |   9.365 ms |       0.07% |  -4.185 us |  -0.04% |   SAME   |
|   F32   |    2^16    |     0     |  31.584 us |       5.23% |  29.694 us |       4.41% |  -1.889 us |  -5.98% |   FAST   |
|   F32   |    2^20    |     0     |  51.791 us |       2.39% |  45.221 us |       2.24% |  -6.570 us | -12.69% |   FAST   |
|   F32   |    2^24    |     0     | 616.578 us |       0.27% | 613.071 us |       0.48% |  -3.507 us |  -0.57% |   FAST   |
|   F32   |    2^28    |     0     |   9.527 ms |       0.48% |   9.530 ms |       0.07% |   3.694 us |   0.04% |   SAME   |
|   F64   |    2^16    |     1     |  26.844 us |       7.06% |  23.035 us |       4.23% |  -3.809 us | -14.19% |   FAST   |
|   F64   |    2^20    |     1     |  71.937 us |       1.81% |  65.336 us |       3.80% |  -6.601 us |  -9.18% |   FAST   |
|   F64   |    2^24    |     1     | 812.408 us |       0.23% | 807.173 us |       0.34% |  -5.235 us |  -0.64% |   FAST   |
|   F64   |    2^28    |     1     |  12.658 ms |       0.33% |  12.657 ms |       0.05% |  -0.973 us |  -0.01% |   SAME   |
|   F64   |    2^16    |   0.544   |  33.476 us |       5.00% |  30.172 us |       3.47% |  -3.303 us |  -9.87% |   FAST   |
|   F64   |    2^20    |   0.544   |  97.826 us |       0.95% |  88.960 us |       2.66% |  -8.866 us |  -9.06% |   FAST   |
|   F64   |    2^24    |   0.544   |   1.187 ms |       0.23% |   1.183 ms |       0.29% |  -4.390 us |  -0.37% |   FAST   |
|   F64   |    2^28    |   0.544   |  18.462 ms |       0.34% |  18.454 ms |       0.05% |  -8.047 us |  -0.04% |   SAME   |
|   F64   |    2^16    |     0     |  33.291 us |       3.02% |  30.010 us |       4.02% |  -3.281 us |  -9.86% |   FAST   |
|   F64   |    2^20    |     0     |  99.584 us |       1.16% |  91.254 us |       2.79% |  -8.330 us |  -8.36% |   FAST   |
|   F64   |    2^24    |     0     |   1.208 ms |       0.46% |   1.204 ms |       0.27% |  -4.783 us |  -0.40% |   FAST   |
|   F64   |    2^28    |     0     |  18.965 ms |       0.30% |  18.968 ms |       0.04% |   3.226 us |   0.02% |   SAME   |

@github-actions

This comment has been minimized.

miscco added 2 commits April 1, 2026 08:40
This implements the partition algorithms for the cuda backend.

* `std::partition` see https://en.cppreference.com/w/cpp/algorithm/partition.html
* `std::partition_copy` see https://en.cppreference.com/w/cpp/algorithm/partition_copy.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#7745
This implements the stable partition algorithm for the cuda backend.

* `std::stable_partition` see https://en.cppreference.com/w/cpp/algorithm/stable_partition.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#8228
@miscco miscco force-pushed the parallel_stable_partition branch from 53663fa to ac0aaa4 Compare April 1, 2026 06:40
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 33m: Pass: 100%/99 | Total: 1d 05h | Max: 55m 06s | Hits: 96%/258450

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::stable_partition

1 participant