-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KHR] Add sycl_khr_max_num_work_groups extension #712
base: main
Are you sure you want to change the base?
Conversation
= sycl_khr_max_num_work_groups | ||
|
||
This extension allows developers to query iteration bounds in each dimension for a ND-range or basic range kernel. | ||
The implementation ensures the execution of the ND-range kernel if its global size is less than of equal to `max_num_work_groups_nd_range<N>` in each dimension. This condition applies to basic range kernels with `max_num_work_groups_range<N>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to expose this query separately for ND-range and basic (range) kernels.
According to Section 3.7.2, SYCL kernel execution mode, work-groups only exist for ND-range kernels. A basic kernel is launched only with a number of work-items -- there is no way to specify the number of work-groups to use, nor to query the number of work-groups used by the implementation.
This is why the DPC++ extension only defines max_work_groups
for ND-range kernels. I think we should align with that design if we're going to pursue this feature as a KHR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it doesn't make sense to talk about "work-groups" for a basic kernels.
Still, the user cannot check the bounds of a basic range kernel before submission and I think we should be able to query that so the assertion still remains valid.
Maybe a renaming max_num_work_groups_range<N>
--> max_global_size_range<N>
or max_basic_range<N>
? Or maybe this value already exists in the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size
, which is defined in Table 135.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR by removing max_num_work_groups_range
.
I also renamed max_num_work_groups_nd_range<N>
--> max_num_work_groups<N>
since it's implicit that it is the maximum for an ND-Range kernel (what about hierarchical kernels, should it be the same?)
Let me know if there should be other changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size
I totally forgot about this query. Sorry, @Aympab! Looking at the table It's always return range<3>
. So if I want to submit range<1>
, I don't know the math to linearize it. Should we then add new queries to the 'kernel_device_specific`?
But no user code seems to be using it, I don't know what that means. Either kernel bumble scares people, people don't know the API, or people don't need it at all (aka either never using range
or all implementation support by default iteration space large enough or have a different way of computing it).
This PR proposes to add to the specification a device descriptor on the maximum number of work-groups that can be submitted to a
range
ornd_range
parallel for:max_num_work_groups_nd_range<N>
andmax_num_work_groups_range<N>
for N=1, 2 or 3. The device query returns a tupleid<N>
with boundaries in each dimension.Justification
In the current revision of the spec, there is no limitation on the maximum iteration size submitted to a parallel for, supposedly meaning that any number should be valid. But when actually running with large sizes, backend-related issues emerge.
Users rely on these values to check kernel boundaries and often have to hard-code these values, for example Kokkos developers with the SYCL backend (see this PR) or this implementation of blocking/streaming kernels.
The query is already available for all GPU backends:
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_[X,Y,Z]
hipDeviceAttributeMaxGridDim[X,Y,Z]
ze_device_properties_t::maxGroupCount[X,Y,Z]
DPC++ already implements something similar as an extension
Notes
There is a distinction for
nd_range
orrange
for multiple reasons:For N=3, the mapping is straightforward, as it directly queries the backend functions with
X, Y, Z
. The mapping for N=1 and N=2 is unclear, implementers could choose the minimum ofX,Y,Z
or compute a product of other dimensions, for example.Although this PR is largely inspired by DPC++ extension, they initially proposed
max_global_work_groups
which is not actually queryable with CUDA/HIP which is why it is not proposed here.