-
Notifications
You must be signed in to change notification settings - Fork 285
[DOC] Add temp_storage_bytes usage guide #6208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -11,6 +11,47 @@ Device-Wide Primitives | |||
| ../api/device | ||||
|
|
||||
|
|
||||
| Determining Temporary Storage Requirements | ||||
| ++++++++++++++++++++++++++++++++++++++++++++++++++ | ||||
|
|
||||
| **Two-Phase API** (Traditional) | ||||
|
|
||||
| Most CUB device-wide algorithms follow a two-phase usage pattern: | ||||
|
|
||||
| 1. **Query Phase**: Call the algorithm with ``d_temp_storage = nullptr`` to determine required temporary storage size | ||||
| 2. **Execution Phase**: Allocate storage and call the algorithm again to perform the actual operation | ||||
|
|
||||
| **What arguments are needed during the query phase?** | ||||
|
|
||||
| * **Required**: Data types (via template parameters and iterator types) and problem size (``num_items``) | ||||
| * **Can be nullptr/uninitialized**: All input/output pointers (``d_in``, ``d_out``, etc.) | ||||
| * **Note**: The algorithm does not access input data during the query phase | ||||
|
||||
| ALIASES += "devicestorage=When ``d_temp_storage`` is ``nullptr``, no work is done and the required allocation size is returned in ``temp_storage_bytes``." |
This way, a link to this section will appear on each algorithm using @devicestorage. If you'd like to have the PR scope smaller, feel free to file an issue so that this'd be addressed later, just make sure to link the issue here.
Current GPU
Another underspecified aspect is that current GPU probably shouldn't change between the phases. Say, in reduce, temp storage size depends on the occupancy, which might vary between GPUs. For radix sort, some architectures might use onesweep approach, while others would use legacy scheme. I think it's safer to have a requirement that same current GPU is used between phases by default and relax it on per-algorithm basis when needed. @elstehle, @bernhardmgruber what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to also mention the single-phase API