Skip to content

Avoid passing uninitialized values to scan_op#8184

Open
bernhardmgruber wants to merge 14 commits intoNVIDIA:mainfrom
bernhardmgruber:warpspeed_fix_oob_scan_ob
Open

Avoid passing uninitialized values to scan_op#8184
bernhardmgruber wants to merge 14 commits intoNVIDIA:mainfrom
bernhardmgruber:warpspeed_fix_oob_scan_ob

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Mar 26, 2026

This PR adds the necessary checks to not pass uninitialized data to the scan operator in warpspeed scan.

There are two approaches:

  1. Initializing the variables to be scanned with a valid value, then partially overwriting them with the input data, and scanning while ignoring the valid items in the current tile. This may loose performance for initializing the variables, but generates good code for the scanning part (no branching based on the number of valid items).
  2. Guard the scanning part with branches in the last tile to only scan the items that hold actual values. Has zero extra initialization cost but adds several branches to the scanning part.

Approach 1. is proposed by #8134, while this PR initially aimed for approach 2, but is now a mix of both.

I tried guarding the the individual scan steps, but it led to significant regressions:

Details
## [0] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |      I32      |      2^16      |   9.237 us |       0.77% |   9.237 us |       2.49% |   0.000 us |   0.00% |   SAME   |
|   I8    |      I32      |      2^20      |  12.260 us |       8.53% |  12.385 us |       8.01% |   0.125 us |   1.02% |   SAME   |
|   I8    |      I32      |      2^24      |  31.257 us |       2.76% |  34.686 us |       3.16% |   3.430 us |  10.97% |   SLOW   |
|   I8    |      I32      |      2^28      | 261.211 us |       0.43% | 308.538 us |       0.44% |  47.327 us |  18.12% |   SLOW   |
|   I8    |      I64      |      2^16      |   9.136 us |       2.83% |   9.142 us |       3.17% |   0.006 us |   0.07% |   SAME   |
|   I8    |      I64      |      2^20      |  12.534 us |       7.93% |  12.678 us |       8.87% |   0.143 us |   1.14% |   SAME   |
|   I8    |      I64      |      2^24      |  31.425 us |       2.57% |  34.577 us |       3.19% |   3.152 us |  10.03% |   SLOW   |
|   I8    |      I64      |      2^28      | 262.144 us |       0.45% | 307.878 us |       0.49% |  45.734 us |  17.45% |   SLOW   |
|   I8    |      I64      |      2^32      |   3.967 ms |       0.03% |   4.744 ms |       0.04% | 777.203 us |  19.59% |   SLOW   |
|   I16   |      I32      |      2^16      |  11.034 us |       0.96% |  11.288 us |       2.54% |   0.254 us |   2.30% |   SLOW   |
|   I16   |      I32      |      2^20      |  13.676 us |       6.82% |  14.178 us |       7.12% |   0.503 us |   3.68% |   SAME   |
|   I16   |      I32      |      2^24      |  31.050 us |       4.19% |  33.261 us |       3.86% |   2.211 us |   7.12% |   SLOW   |
|   I16   |      I32      |      2^28      | 250.131 us |       0.52% | 290.603 us |       0.43% |  40.472 us |  16.18% |   SLOW   |
|   I16   |      I64      |      2^16      |  11.016 us |       0.86% |  11.190 us |       1.63% |   0.174 us |   1.58% |   SLOW   |
|   I16   |      I64      |      2^20      |  13.737 us |       6.95% |  14.285 us |       7.11% |   0.548 us |   3.99% |   SAME   |
|   I16   |      I64      |      2^24      |  30.760 us |       4.72% |  33.362 us |       4.11% |   2.601 us |   8.46% |   SLOW   |
|   I16   |      I64      |      2^28      | 249.678 us |       0.58% | 290.062 us |       0.46% |  40.384 us |  16.17% |   SLOW   |
|   I16   |      I64      |      2^32      |   3.745 ms |       0.11% |   4.398 ms |       0.04% | 652.685 us |  17.43% |   SLOW   |
|   I32   |      I32      |      2^16      |   9.750 us |       9.31% |  10.190 us |      10.03% |   0.440 us |   4.51% |   SAME   |
|   I32   |      I32      |      2^20      |  13.781 us |       6.50% |  14.185 us |       7.27% |   0.404 us |   2.93% |   SAME   |
|   I32   |      I32      |      2^24      |  37.643 us |       4.15% |  39.682 us |       4.02% |   2.039 us |   5.42% |   SLOW   |
|   I32   |      I32      |      2^28      | 320.656 us |       0.58% | 335.418 us |       0.51% |  14.762 us |   4.60% |   SLOW   |
|   I32   |      I64      |      2^16      |   9.723 us |       9.47% |  10.423 us |       9.79% |   0.700 us |   7.20% |   SAME   |
|   I32   |      I64      |      2^20      |  13.815 us |       6.45% |  13.950 us |       6.99% |   0.135 us |   0.98% |   SAME   |
|   I32   |      I64      |      2^24      |  37.615 us |       4.08% |  39.549 us |       4.22% |   1.934 us |   5.14% |   SLOW   |
|   I32   |      I64      |      2^28      | 320.498 us |       0.52% | 335.587 us |       0.47% |  15.089 us |   4.71% |   SLOW   |
|   I32   |      I64      |      2^32      |   4.995 ms |       1.05% |   5.338 ms |       1.57% | 343.183 us |   6.87% |   SLOW   |
|   I64   |      I32      |      2^16      |  11.463 us |       6.02% |  11.381 us |       4.72% |  -0.083 us |  -0.72% |   SAME   |
|   I64   |      I32      |      2^20      |  15.746 us |       5.95% |  15.483 us |       3.94% |  -0.263 us |  -1.67% |   SAME   |
|   I64   |      I32      |      2^24      |  59.992 us |       1.93% |  60.882 us |       1.89% |   0.890 us |   1.48% |   SAME   |
|   I64   |      I32      |      2^28      | 695.400 us |       0.16% | 718.731 us |       0.17% |  23.331 us |   3.36% |   SLOW   |
|   I64   |      I64      |      2^16      |  11.267 us |       0.61% |  11.261 us |       0.60% |  -0.006 us |  -0.05% |   SAME   |
|   I64   |      I64      |      2^20      |  15.295 us |       1.64% |  15.354 us |       1.23% |   0.058 us |   0.38% |   SAME   |
|   I64   |      I64      |      2^24      |  59.879 us |       1.88% |  61.239 us |       1.99% |   1.360 us |   2.27% |   SLOW   |
|   I64   |      I64      |      2^28      | 695.808 us |       0.17% | 719.205 us |       0.17% |  23.397 us |   3.36% |   SLOW   |
|   I64   |      I64      |      2^32      |  11.036 ms |       0.60% |  11.249 ms |       0.02% | 212.231 us |   1.92% |   SLOW   |
|  I128   |      I32      |      2^16      |  13.209 us |       2.26% |  12.723 us |       7.31% |  -0.485 us |  -3.68% |   FAST   |
|  I128   |      I32      |      2^20      |  23.454 us |       1.24% |  23.584 us |       1.44% |   0.130 us |   0.56% |   SAME   |
|  I128   |      I32      |      2^24      | 170.889 us |       0.62% | 171.804 us |       0.64% |   0.915 us |   0.54% |   SAME   |
|  I128   |      I32      |      2^28      |   2.514 ms |       0.10% |   2.520 ms |       0.10% |   5.988 us |   0.24% |   SLOW   |
|  I128   |      I64      |      2^16      |  12.472 us |       7.29% |  12.638 us |       7.42% |   0.166 us |   1.33% |   SAME   |
|  I128   |      I64      |      2^20      |  23.264 us |       1.80% |  23.583 us |       1.57% |   0.319 us |   1.37% |   SAME   |
|  I128   |      I64      |      2^24      | 170.597 us |       0.61% | 171.632 us |       0.68% |   1.036 us |   0.61% |   SAME   |
|  I128   |      I64      |      2^28      |   2.515 ms |       0.11% |   2.520 ms |       0.10% |   5.078 us |   0.20% |   SLOW   |
|  I128   |      I64      |      2^32      |  40.059 ms |       0.02% |  40.137 ms |       0.02% |  77.492 us |   0.19% |   SLOW   |
|   F32   |      I32      |      2^16      |  10.762 us |       6.60% |  10.974 us |       5.52% |   0.211 us |   1.96% |   SAME   |
|   F32   |      I32      |      2^20      |  13.998 us |       7.26% |  14.335 us |       7.14% |   0.336 us |   2.40% |   SAME   |
|   F32   |      I32      |      2^24      |  40.269 us |       3.80% |  40.885 us |       4.07% |   0.617 us |   1.53% |   SAME   |
|   F32   |      I32      |      2^28      | 346.253 us |       0.47% | 354.709 us |       0.46% |   8.456 us |   2.44% |   SLOW   |
|   F32   |      I64      |      2^16      |  10.657 us |       8.21% |  11.055 us |       5.65% |   0.399 us |   3.74% |   SAME   |
|   F32   |      I64      |      2^20      |  13.963 us |       7.04% |  14.366 us |       7.24% |   0.403 us |   2.89% |   SAME   |
|   F32   |      I64      |      2^24      |  41.014 us |       4.26% |  40.930 us |       3.84% |  -0.085 us |  -0.21% |   SAME   |
|   F32   |      I64      |      2^28      | 346.076 us |       0.42% | 354.647 us |       0.45% |   8.571 us |   2.48% |   SLOW   |
|   F32   |      I64      |      2^32      |   5.239 ms |       0.40% |   5.376 ms |       0.36% | 137.578 us |   2.63% |   SLOW   |
|   F64   |      I32      |      2^16      |  11.160 us |       1.18% |  11.306 us |       2.93% |   0.146 us |   1.31% |   SLOW   |
|   F64   |      I32      |      2^20      |  15.438 us |       2.55% |  15.791 us |       5.27% |   0.354 us |   2.29% |   SAME   |
|   F64   |      I32      |      2^24      |  64.858 us |       1.47% |  66.396 us |       1.48% |   1.539 us |   2.37% |   SLOW   |
|   F64   |      I32      |      2^28      | 785.357 us |       0.16% | 810.879 us |       0.14% |  25.522 us |   3.25% |   SLOW   |
|   F64   |      I64      |      2^16      |  11.058 us |       1.35% |  11.067 us |       1.75% |   0.009 us |   0.08% |   SAME   |
|   F64   |      I64      |      2^20      |  15.400 us |       1.41% |  15.653 us |       4.54% |   0.253 us |   1.64% |   SLOW   |
|   F64   |      I64      |      2^24      |  64.822 us |       1.53% |  66.169 us |       1.51% |   1.347 us |   2.08% |   SLOW   |
|   F64   |      I64      |      2^28      | 785.231 us |       0.13% | 811.046 us |       0.13% |  25.816 us |   3.29% |   SLOW   |
|   F64   |      I64      |      2^32      |  12.340 ms |       0.01% |  12.740 ms |       0.01% | 400.013 us |   3.24% |   SLOW   |
|   C32   |      I32      |      2^16      |  10.994 us |       6.24% |  11.290 us |       1.97% |   0.296 us |   2.69% |   SLOW   |
|   C32   |      I32      |      2^20      |  15.454 us |       4.01% |  15.416 us |       2.42% |  -0.038 us |  -0.25% |   SAME   |
|   C32   |      I32      |      2^24      |  60.107 us |       1.98% |  61.190 us |       2.04% |   1.083 us |   1.80% |   SAME   |
|   C32   |      I32      |      2^28      | 682.906 us |       0.18% | 721.835 us |       0.15% |  38.929 us |   5.70% |   SLOW   |
|   C32   |      I64      |      2^16      |  10.257 us |      10.03% |  11.320 us |       3.26% |   1.063 us |  10.36% |   SLOW   |
|   C32   |      I64      |      2^20      |  15.546 us |       4.81% |  15.421 us |       2.55% |  -0.124 us |  -0.80% |   SAME   |
|   C32   |      I64      |      2^24      |  60.094 us |       1.78% |  61.198 us |       2.05% |   1.105 us |   1.84% |   SLOW   |
|   C32   |      I64      |      2^28      | 683.419 us |       0.20% | 722.465 us |       0.16% |  39.046 us |   5.71% |   SLOW   |
|   C32   |      I64      |      2^32      |  10.660 ms |       0.02% |  11.303 ms |       0.02% | 642.811 us |   6.03% |   SLOW   |

So I switched to an approach that branches earlier based on whether we are handling the last tile or not. This takes a <1% hit for I8, I16, and I128. I think that's ok. The only possible way to avoid this regression now is to allow the scan operator to be executed on garbage data, which is formally UB, but we could argue that we know what our hardware is doing.

## [0] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   9.204 us |       1.36% |   9.203 us |       1.33% |   -0.000 us |  -0.00% |   SAME   |
|   I8    |      I32      |      2^20      |  12.544 us |       7.68% |  12.484 us |       8.30% |   -0.061 us |  -0.49% |   SAME   |
|   I8    |      I32      |      2^24      |  30.458 us |       3.35% |  30.709 us |       3.35% |    0.252 us |   0.83% |   SAME   |
|   I8    |      I32      |      2^28      | 260.386 us |       0.44% | 263.420 us |       0.42% |    3.033 us |   1.16% |   SLOW   |
|   I8    |      I64      |      2^16      |  10.325 us |       9.84% |  10.575 us |       8.92% |    0.250 us |   2.42% |   SAME   |
|   I8    |      I64      |      2^20      |  12.665 us |       8.06% |  12.619 us |       8.42% |   -0.046 us |  -0.37% |   SAME   |
|   I8    |      I64      |      2^24      |  30.490 us |       3.29% |  30.666 us |       3.34% |    0.176 us |   0.58% |   SAME   |
|   I8    |      I64      |      2^28      | 261.445 us |       0.39% | 263.159 us |       0.42% |    1.714 us |   0.66% |   SLOW   |
|   I8    |      I64      |      2^32      |   3.967 ms |       0.03% |   3.998 ms |       0.03% |   30.912 us |   0.78% |   SLOW   |
|   I16   |      I32      |      2^16      |  11.254 us |       0.61% |  11.243 us |       0.82% |   -0.011 us |  -0.10% |   SAME   |
|   I16   |      I32      |      2^20      |  13.497 us |       4.85% |  13.536 us |       4.93% |    0.039 us |   0.29% |   SAME   |
|   I16   |      I32      |      2^24      |  31.348 us |       3.50% |  31.636 us |       3.45% |    0.288 us |   0.92% |   SAME   |
|   I16   |      I32      |      2^28      | 249.893 us |       0.48% | 252.240 us |       0.42% |    2.347 us |   0.94% |   SLOW   |
|   I16   |      I64      |      2^16      |  11.295 us |       3.47% |  11.272 us |       3.38% |   -0.023 us |  -0.20% |   SAME   |
|   I16   |      I64      |      2^20      |  13.394 us |       4.61% |  13.478 us |       5.31% |    0.085 us |   0.63% |   SAME   |
|   I16   |      I64      |      2^24      |  31.376 us |       3.55% |  31.444 us |       3.59% |    0.068 us |   0.22% |   SAME   |
|   I16   |      I64      |      2^28      | 250.058 us |       0.48% | 252.060 us |       0.46% |    2.002 us |   0.80% |   SLOW   |
|   I16   |      I64      |      2^32      |   3.745 ms |       0.10% |   3.781 ms |       0.10% |   35.866 us |   0.96% |   SLOW   |
|   I32   |      I32      |      2^16      |  11.256 us |       1.35% |  11.193 us |       3.40% |   -0.063 us |  -0.56% |   SAME   |
|   I32   |      I32      |      2^20      |  13.381 us |       5.33% |  13.512 us |       5.11% |    0.131 us |   0.98% |   SAME   |
|   I32   |      I32      |      2^24      |  37.631 us |       3.73% |  37.465 us |       3.76% |   -0.166 us |  -0.44% |   SAME   |
|   I32   |      I32      |      2^28      | 321.212 us |       0.50% | 321.215 us |       0.57% |    0.003 us |   0.00% |   SAME   |
|   I32   |      I64      |      2^16      |  11.207 us |       3.54% |  11.002 us |       4.50% |   -0.204 us |  -1.82% |   SAME   |
|   I32   |      I64      |      2^20      |  13.523 us |       5.47% |  13.440 us |       4.77% |   -0.083 us |  -0.61% |   SAME   |
|   I32   |      I64      |      2^24      |  37.718 us |       3.52% |  37.433 us |       3.83% |   -0.285 us |  -0.76% |   SAME   |
|   I32   |      I64      |      2^28      | 321.162 us |       0.51% | 320.992 us |       0.52% |   -0.171 us |  -0.05% |   SAME   |
|   I32   |      I64      |      2^32      |   5.082 ms |       1.23% |   5.077 ms |       1.18% |   -4.632 us |  -0.09% |   SAME   |
|   I64   |      I32      |      2^16      |  11.385 us |       5.20% |  11.564 us |       6.53% |    0.179 us |   1.57% |   SAME   |
|   I64   |      I32      |      2^20      |  15.421 us |       3.41% |  15.313 us |       2.99% |   -0.108 us |  -0.70% |   SAME   |
|   I64   |      I32      |      2^24      |  59.339 us |       1.91% |  58.512 us |       1.81% |   -0.827 us |  -1.39% |   SAME   |
|   I64   |      I32      |      2^28      | 694.100 us |       0.16% | 684.983 us |       0.17% |   -9.117 us |  -1.31% |   FAST   |
|   I64   |      I64      |      2^16      |  11.265 us |       0.69% |  11.262 us |       1.12% |   -0.003 us |  -0.03% |   SAME   |
|   I64   |      I64      |      2^20      |  15.360 us |       1.12% |  15.351 us |       0.88% |   -0.009 us |  -0.06% |   SAME   |
|   I64   |      I64      |      2^24      |  59.324 us |       1.85% |  58.559 us |       1.90% |   -0.765 us |  -1.29% |   SAME   |
|   I64   |      I64      |      2^28      | 694.754 us |       0.16% | 685.358 us |       0.14% |   -9.396 us |  -1.35% |   FAST   |
|   I64   |      I64      |      2^32      |  11.267 ms |       0.82% |  11.275 ms |       0.94% |    8.527 us |   0.08% |   SAME   |
|  I128   |      I32      |      2^16      |  13.249 us |       2.52% |  13.206 us |       3.29% |   -0.043 us |  -0.32% |   SAME   |
|  I128   |      I32      |      2^20      |  23.562 us |       1.41% |  23.578 us |       1.26% |    0.016 us |   0.07% |   SAME   |
|  I128   |      I32      |      2^24      | 170.798 us |       0.68% | 171.408 us |       0.69% |    0.610 us |   0.36% |   SAME   |
|  I128   |      I32      |      2^28      |   2.512 ms |       0.09% |   2.522 ms |       0.09% |    9.229 us |   0.37% |   SLOW   |
|  I128   |      I64      |      2^16      |  13.056 us |       5.10% |  13.040 us |       5.22% |   -0.016 us |  -0.12% |   SAME   |
|  I128   |      I64      |      2^20      |  23.472 us |       1.65% |  23.567 us |       1.40% |    0.094 us |   0.40% |   SAME   |
|  I128   |      I64      |      2^24      | 170.661 us |       0.65% | 171.610 us |       0.64% |    0.949 us |   0.56% |   SAME   |
|  I128   |      I64      |      2^28      |   2.513 ms |       0.10% |   2.523 ms |       0.09% |    9.124 us |   0.36% |   SLOW   |
|  I128   |      I64      |      2^32      |  40.041 ms |       0.02% |  40.188 ms |       0.02% |  146.920 us |   0.37% |   SLOW   |
|   F32   |      I32      |      2^16      |  10.843 us |       7.13% |  10.761 us |       6.08% |   -0.081 us |  -0.75% |   SAME   |
|   F32   |      I32      |      2^20      |  14.438 us |       6.97% |  14.251 us |       7.21% |   -0.188 us |  -1.30% |   SAME   |
|   F32   |      I32      |      2^24      |  38.979 us |       3.17% |  38.871 us |       3.18% |   -0.107 us |  -0.28% |   SAME   |
|   F32   |      I32      |      2^28      | 343.942 us |       0.34% | 342.615 us |       0.36% |   -1.327 us |  -0.39% |   FAST   |
|   F32   |      I64      |      2^16      |  10.642 us |       9.27% |  10.420 us |       9.23% |   -0.222 us |  -2.09% |   SAME   |
|   F32   |      I64      |      2^20      |  14.258 us |       7.21% |  14.170 us |       7.01% |   -0.088 us |  -0.62% |   SAME   |
|   F32   |      I64      |      2^24      |  38.869 us |       3.48% |  38.497 us |       3.13% |   -0.372 us |  -0.96% |   SAME   |
|   F32   |      I64      |      2^28      | 344.123 us |       0.34% | 342.464 us |       0.38% |   -1.659 us |  -0.48% |   FAST   |
|   F32   |      I64      |      2^32      |   5.234 ms |       0.37% |   5.213 ms |       0.31% |  -21.012 us |  -0.40% |   FAST   |
|   F64   |      I32      |      2^16      |  11.258 us |       0.60% |  11.124 us |       0.59% |   -0.134 us |  -1.19% |   FAST   |
|   F64   |      I32      |      2^20      |  15.383 us |       1.76% |  15.356 us |       1.34% |   -0.028 us |  -0.18% |   SAME   |
|   F64   |      I32      |      2^24      |  64.225 us |       1.45% |  63.659 us |       1.65% |   -0.566 us |  -0.88% |   SAME   |
|   F64   |      I32      |      2^28      | 785.037 us |       0.13% | 777.901 us |       0.14% |   -7.136 us |  -0.91% |   FAST   |
|   F64   |      I64      |      2^16      |  11.261 us |       0.96% |  11.026 us |       1.22% |   -0.235 us |  -2.09% |   FAST   |
|   F64   |      I64      |      2^20      |  15.365 us |       1.64% |  15.332 us |       0.48% |   -0.033 us |  -0.21% |   SAME   |
|   F64   |      I64      |      2^24      |  64.143 us |       1.46% |  63.736 us |       1.61% |   -0.407 us |  -0.64% |   SAME   |
|   F64   |      I64      |      2^28      | 784.877 us |       0.14% | 777.648 us |       0.13% |   -7.230 us |  -0.92% |   FAST   |
|   F64   |      I64      |      2^32      |  12.337 ms |       0.01% |  12.224 ms |       0.01% | -112.673 us |  -0.91% |   FAST   |
|   C32   |      I32      |      2^16      |  11.246 us |       1.68% |  11.257 us |       0.54% |    0.010 us |   0.09% |   SAME   |
|   C32   |      I32      |      2^20      |  15.467 us |       3.18% |  15.256 us |       1.15% |   -0.211 us |  -1.37% |   FAST   |
|   C32   |      I32      |      2^24      |  60.611 us |       1.91% |  58.837 us |       2.02% |   -1.775 us |  -2.93% |   FAST   |
|   C32   |      I32      |      2^28      | 683.020 us |       0.17% | 682.186 us |       0.17% |   -0.834 us |  -0.12% |   SAME   |
|   C32   |      I64      |      2^16      |  10.657 us |       8.73% |  11.013 us |       5.80% |    0.357 us |   3.35% |   SAME   |
|   C32   |      I64      |      2^20      |  15.474 us |       3.36% |  15.206 us |       1.75% |   -0.269 us |  -1.74% |   SAME   |
|   C32   |      I64      |      2^24      |  60.791 us |       2.15% |  58.853 us |       2.09% |   -1.938 us |  -3.19% |   FAST   |
|   C32   |      I64      |      2^28      | 683.336 us |       0.16% | 682.301 us |       0.18% |   -1.035 us |  -0.15% |   SAME   |
|   C32   |      I64      |      2^32      |  10.654 ms |       0.02% |  10.667 ms |       0.02% |   13.108 us |   0.12% |   SLOW   |

I wonder whether we should also branch early in the reduction squad.

Fixes: #8136

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner March 26, 2026 12:48
@bernhardmgruber bernhardmgruber requested a review from miscco March 26, 2026 12:48
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 26, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 26, 2026
@bernhardmgruber bernhardmgruber force-pushed the warpspeed_fix_oob_scan_ob branch from 4058405 to c47bd23 Compare March 26, 2026 12:56
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner March 26, 2026 12:56
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Mar 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ridiculous complexity. I am praying that just initializing the values will work

@bernhardmgruber
Copy link
Copy Markdown
Contributor Author

This is an ridiculous complexity. I am praying that just initializing the values will work

The diff appears larger than it is. I mostly introduce an early branch if the tile is full or partial and then handle the partial tiles at the various scan steps.

@miscco
Copy link
Copy Markdown
Contributor

miscco commented Mar 30, 2026

This is an ridiculous complexity. I am praying that just initializing the values will work

The diff appears larger than it is. I mostly introduce an early branch if the tile is full or partial and then handle the partial tiles at the various scan steps.

True, and in line with the old lookback implementations

@bernhardmgruber
Copy link
Copy Markdown
Contributor Author

I discussed this with @gevtushenko briefly and he suggested to let the scan operator have invalid data for known operators and data types, so we don't need to eat any regression.

@miscco
Copy link
Copy Markdown
Contributor

miscco commented Mar 31, 2026

I am pretty sure I had a branch for that in gitlab

@bernhardmgruber bernhardmgruber force-pushed the warpspeed_fix_oob_scan_ob branch from dac0091 to cf730c0 Compare March 31, 2026 07:17
@github-actions

This comment has been minimized.

Comment on lines +265 to +266
// if we have an identity, just fill the out-of-bounds items with it and use the full warp scan, since it's faster
if constexpr (cuda::has_identity_element_v<ScanOpT, Tp>)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This only works for those scan operations that have an identity element.

Would it be possible to store the final valid item and use that for all the scan operations that do not have a valid item?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured if the scan operator does not have an identity, then it's maybe something user provided and could be complex. The code path where we prefill with the identity is about 1% faster than when we just guard calling the scan operator. I would just take the guarded path for any operator we don't know.

Also, which value do you mean with final valid item?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean data[valid_items -1] We are already loading that anyhow, so we should be able to just use that

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the warpspeed_fix_oob_scan_ob branch from ee4b7f0 to cd6c9ef Compare April 1, 2026 20:04
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

😬 CI Workflow Results

🟥 Finished in 3h 21m: Pass: 99%/300 | Total: 12d 00h | Max: 3h 02m | Hits: 67%/267664

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[BUG] warpspeed scan causes OOB reads in some Thrust tests

2 participants