Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/cub/api_docs/device_wide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,47 @@ Device-Wide Primitives
../api/device


Determining Temporary Storage Requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to also mention the single-phase API

++++++++++++++++++++++++++++++++++++++++++++++++++

**Two-Phase API** (Traditional)

Most CUB device-wide algorithms follow a two-phase usage pattern:

1. **Query Phase**: Call the algorithm with ``d_temp_storage = nullptr`` to determine required temporary storage size
2. **Execution Phase**: Allocate storage and call the algorithm again to perform the actual operation

**What arguments are needed during the query phase?**

* **Required**: Data types (via template parameters and iterator types) and problem size (``num_items``)
* **Can be nullptr/uninitialized**: All input/output pointers (``d_in``, ``d_out``, etc.)
* **Note**: The algorithm does not access input data during the query phase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actually providing this guarantee? Can you point at a place from which you derive this fact?

Hypothetical scenario: we could determine the temporary storage based on the alignment of another input pointer. AFAIK we don't do that, but currently, we could.

However, since we seem to be vage about what's required on all parameters that are not taking part in the query phase, maybe we should just define what's being suggested here. But that requires some broader approval and probably a review of all existing APIs.

@gevtushenko what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber Thanks for flagging this great point. I re-audited the device-wide dispatch layer to make sure we're not overpromising. Every dispatcher we ship

(dispatch_reduce*.cuh, dispatch_scan*.cuh, dispatch_select_if.cuh,
dispatch_histogram.cuh, dispatch_radix_sort.cuh, dispatch_merge*.cuh,
dispatch_rle.cuh, dispatch_unique_by_key.cuh, dispatch_three_way_partition.cuh,
dispatch_topk.cuh, dispatch_adjacent_difference.cuh, dispatch_batch_memcpy.cuh)

exits immediately when d_temp_storage == nullptr- no kernels launch and no user pointers are dereferenced. I've updated the “What arguments are needed during the query phase?” bullets to call that out explicitly and list the audited dispatchers. Please let me know if you’d like me to add anything else or tighten the wording further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should list the specific implementations, but rather provide a general guarantee.

@gevtushenko can we agree that any arguments, except for the temporary storage pointer and size reference, are not inspected during a size query call of a CUB device API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber Thanks! I’ve trimmed the doc to state the general guarantee. Happy to adjust further once we hear back from @gevtushenko.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aminsed thank you for the contribution!

It's a long overdue on our end to provide an overview of two-phase interface. We do receive questions regarding it, and I'm glad there'll be a single source of truth going forward. Access to the iterators during temp storage query stage is one of these frequent questions, so we should document it.

Iterator Access

Are we actually providing this guarantee? Can you point at a place from which you derive this fact?

Each device-level algorithm says something like "the required allocation size is written to temp_storage_bytes and no work is done". The only reasonable way to define "work" in the context of CUB is something involving driver: kernel invocation, data movement, memory space query etc.

Users frequently have to estimate the temporary storage size before they have data, so it's common and valid use to estimate temp storage with int *d_in;(can be uninitialized, can be nullptr etc. in user code), which would break if we tried accessing iterators. So even if accessing these iterator made sense, we don't have a freedom to start doing that for existing algorithms. If we discover use case requiring access to iterators at query stage at any point, I'm confident that we'd have to go through guarantees API for user to explicitly tell us it's fine to do so.

So the only comment I'd have is that current phrasing is a bit limiting: "The algorithm does not access input data during the query phase". It shouldn't be specific to input data, since we won't access output data as well. Maybe we can rephrase it in the lines of driver use / work etc. or at least relax phrazing to "does not access iterators".

Suggestions

Doxygen Alias

To make this section more easily discoverable, consider adding a reference to it in the doxygen aliases:

ALIASES += "devicestorage=When ``d_temp_storage`` is ``nullptr``, no work is done and the required allocation size is returned in ``temp_storage_bytes``."

This way, a link to this section will appear on each algorithm using @devicestorage. If you'd like to have the PR scope smaller, feel free to file an issue so that this'd be addressed later, just make sure to link the issue here.

Current GPU

Another underspecified aspect is that current GPU probably shouldn't change between the phases. Say, in reduce, temp storage size depends on the occupancy, which might vary between GPUs. For radix sort, some architectures might use onesweep approach, while others would use legacy scheme. I think it's safer to have a requirement that same current GPU is used between phases by default and relax it on per-algorithm basis when needed. @elstehle, @bernhardmgruber what do you think?


Example pattern:

.. code-block:: c++

// Determine temporary storage requirements
void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;

cub::DeviceReduce::Sum(
d_temp_storage, temp_storage_bytes,
nullptr, nullptr, num_items); // Input/output pointers can be null

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run the actual algorithm with real pointers
cub::DeviceReduce::Sum(
d_temp_storage, temp_storage_bytes,
d_in, d_out, num_items);

**Single-Phase API** (Environment-Based)

Some algorithms provide environment-based overloads that eliminate the two-phase call pattern.
These APIs accept an execution environment parameter. See the individual algorithm documentation for availability.

CUB device-level single-problem parallel algorithms:

* :cpp:struct:`cub::DeviceAdjacentDifference` computes the difference between adjacent elements residing within device-accessible memory
Expand Down