[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148
Labels
feature request
New feature or request.
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.
Supporting these algorithms for problem sizes in excess of
INT_MAX
can be done with streaming approach with streaming logic encoded in algorithm's dispatcher. Dispatcher needs to increment iterators on the host.This is presently not supported by
cccl.c.parallel
, sinceindirect_arg_t
does not implement increment operator.Since
indirect_arg_t
is used to representcccl_value_t
,cccl_operation_t
andcccl_iterator_t
, and incrementing only makes sense for iterators, a dedicated typeindirect_iterator_t
must be introduced, which may implement theoperator+=
.If the entirety of iterator state is user-defined,
cuda.parallel
must provide host function pointer to increment iterator's state by compilingadvance
function for the host.If we define the state of a struct that contains
size_t linear_id
in addition to user-defined state, we could get rid of user-definedadvance
function altogether, but would need to provide access tolinear_id
to thedereference
function.Approached need to be prototyped and compared.
Describe the solution you'd like
The solution should unblock #3764
Additional context
#3764 (comment)
The text was updated successfully, but these errors were encountered: