[KHR] Add sycl_khr_max_num_work_groups extension #712

Aympab · 2025-02-11T19:33:09Z

This PR proposes to add to the specification a device descriptor on the maximum number of work-groups that can be submitted to a range or nd_range parallel for: max_num_work_groups_nd_range<N> and max_num_work_groups_range<N> for N=1, 2 or 3. The device query returns a tuple id<N> with boundaries in each dimension.

Justification

In the current revision of the spec, there is no limitation on the maximum iteration size submitted to a parallel for, supposedly meaning that any number should be valid. But when actually running with large sizes, backend-related issues emerge.
Users rely on these values to check kernel boundaries and often have to hard-code these values, for example Kokkos developers with the SYCL backend (see this PR) or this implementation of blocking/streaming kernels.
The query is already available for all GPU backends:
- CUDA CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_[X,Y,Z]
- hip hipDeviceAttributeMaxGridDim[X,Y,Z]
- L0 ze_device_properties_t::maxGroupCount[X,Y,Z]
DPC++ already implements something similar as an extension

Notes

There is a distinction for nd_range or range for multiple reasons:
- The semantics of range and nd_range differ which might impact the maximum size of the iteration space
- Implementers can choose to map higher-level optimization methods for basic range kernels, or directly map on the lower level limitations
For N=3, the mapping is straightforward, as it directly queries the backend functions with X, Y, Z. The mapping for N=1 and N=2 is unclear, implementers could choose the minimum of X,Y,Z or compute a product of other dimensions, for example.
Although this PR is largely inspired by DPC++ extension, they initially proposed max_global_work_groups which is not actually queryable with CUDA/HIP which is why it is not proposed here.
- As a user, the main concern is ensuring that the kernel’s iteration range does not exceed the maximum values for each dimension. The rest should be implementation-defined (e.g., whether the mapping is direct or if blocking/streaming is used).

CLAassistant · 2025-02-11T19:33:16Z

All committers have signed the CLA.

Pennycook · 2025-02-12T10:43:35Z

adoc/extensions/sycl_khr_max_num_work_groups.adoc

+= sycl_khr_max_num_work_groups
+
+This extension allows developers to query iteration bounds in each dimension for a ND-range or basic range kernel.
+The implementation ensures the execution of the ND-range kernel if its global size is less than of equal to `max_num_work_groups_nd_range<N>` in each dimension. This condition applies to basic range kernels with `max_num_work_groups_range<N>`.


I don't think it makes sense to expose this query separately for ND-range and basic (range) kernels.

According to Section 3.7.2, SYCL kernel execution mode, work-groups only exist for ND-range kernels. A basic kernel is launched only with a number of work-items -- there is no way to specify the number of work-groups to use, nor to query the number of work-groups used by the implementation.

This is why the DPC++ extension only defines max_work_groups for ND-range kernels. I think we should align with that design if we're going to pursue this feature as a KHR.

Indeed it doesn't make sense to talk about "work-groups" for a basic kernels.
Still, the user cannot check the bounds of a basic range kernel before submission and I think we should be able to query that so the assertion still remains valid.

Maybe a renaming max_num_work_groups_range<N> --> max_global_size_range<N> or max_basic_range<N>? Or maybe this value already exists in the specification?

There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size, which is defined in Table 135.

I updated the PR by removing max_num_work_groups_range.
I also renamed max_num_work_groups_nd_range<N> --> max_num_work_groups<N> since it's implicit that it is the maximum for an ND-Range kernel (what about hierarchical kernels, should it be the same?)

Let me know if there should be other changes

There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size

I totally forgot about this query. Sorry, @Aympab! Looking at the table It's always return range<3>. So if I want to submit range<1>, I don't know the math to linearize it. Should we then add new queries to the 'kernel_device_specific`?

But no user code seems to be using it, I don't know what that means. Either kernel bumble scares people, people don't know the API, or people don't need it at all (aka either never using range or all implementation support by default iteration space large enough or have a different way of computing it).

init

0a2203b

sycl-issue-bot bot mentioned this pull request Feb 11, 2025

[Spec change] [KHR] Add sycl_khr_max_num_work_groups extension KhronosGroup/SYCL-CTS#1036

Open

fix macro name

cb8e7d6

Pennycook requested changes Feb 12, 2025

View reviewed changes

Aympab added 3 commits February 12, 2025 14:20

remove max_num_work_groups_range

032ea70

Rename max_num_work_groups_nd_range

e8a5a17

Removed 1 and 2D

731a635

Aympab requested a review from Pennycook February 28, 2025 09:22

Aympab mentioned this pull request Feb 28, 2025

[KHR] Implement max_num_work_groups khr extension AdaptiveCpp/AdaptiveCpp#1736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KHR] Add sycl_khr_max_num_work_groups extension #712

[KHR] Add sycl_khr_max_num_work_groups extension #712

Aympab commented Feb 11, 2025

CLAassistant commented Feb 11, 2025 •

edited

Loading

Pennycook Feb 12, 2025

Aympab Feb 12, 2025

Pennycook Feb 12, 2025

Aympab Feb 12, 2025

TApplencourt Feb 12, 2025 •

edited

Loading

[KHR] Add sycl_khr_max_num_work_groups extension #712

Are you sure you want to change the base?

[KHR] Add sycl_khr_max_num_work_groups extension #712

Conversation

Aympab commented Feb 11, 2025

Justification

Notes

CLAassistant commented Feb 11, 2025 • edited Loading

Pennycook Feb 12, 2025

Choose a reason for hiding this comment

Aympab Feb 12, 2025

Choose a reason for hiding this comment

Pennycook Feb 12, 2025

Choose a reason for hiding this comment

Aympab Feb 12, 2025

Choose a reason for hiding this comment

TApplencourt Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

CLAassistant commented Feb 11, 2025 •

edited

Loading

TApplencourt Feb 12, 2025 •

edited

Loading