Avoid passing uninitialized values to scan_op#8184
Avoid passing uninitialized values to scan_op#8184bernhardmgruber wants to merge 14 commits intoNVIDIA:mainfrom
Conversation
4058405 to
c47bd23
Compare
This comment has been minimized.
This comment has been minimized.
miscco
left a comment
There was a problem hiding this comment.
This is an ridiculous complexity. I am praying that just initializing the values will work
The diff appears larger than it is. I mostly introduce an early branch if the tile is full or partial and then handle the partial tiles at the various scan steps. |
True, and in line with the old lookback implementations |
|
I discussed this with @gevtushenko briefly and he suggested to let the scan operator have invalid data for known operators and data types, so we don't need to eat any regression. |
|
I am pretty sure I had a branch for that in gitlab |
dac0091 to
cf730c0
Compare
This comment has been minimized.
This comment has been minimized.
| // if we have an identity, just fill the out-of-bounds items with it and use the full warp scan, since it's faster | ||
| if constexpr (cuda::has_identity_element_v<ScanOpT, Tp>) |
There was a problem hiding this comment.
Question: This only works for those scan operations that have an identity element.
Would it be possible to store the final valid item and use that for all the scan operations that do not have a valid item?
There was a problem hiding this comment.
I figured if the scan operator does not have an identity, then it's maybe something user provided and could be complex. The code path where we prefill with the identity is about 1% faster than when we just guard calling the scan operator. I would just take the guarded path for any operator we don't know.
Also, which value do you mean with final valid item?
There was a problem hiding this comment.
I mean data[valid_items -1] We are already loading that anyhow, so we should be able to just use that
This comment has been minimized.
This comment has been minimized.
ee4b7f0 to
cd6c9ef
Compare
😬 CI Workflow Results🟥 Finished in 3h 21m: Pass: 99%/300 | Total: 12d 00h | Max: 3h 02m | Hits: 67%/267664See results here. |
This PR adds the necessary checks to not pass uninitialized data to the scan operator in warpspeed scan.
There are two approaches:
Approach 1. is proposed by #8134, while this PR initially aimed for approach 2, but is now a mix of both.
I tried guarding the the individual scan steps, but it led to significant regressions:
Details
So I switched to an approach that branches earlier based on whether we are handling the last tile or not. This takes a <1% hit for I8, I16, and I128. I think that's ok. The only possible way to avoid this regression now is to allow the scan operator to be executed on garbage data, which is formally UB, but we could argue that we know what our hardware is doing.
I wonder whether we should also branch early in the reduction squad.
Fixes: #8136