Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KHR] Add sycl_khr_max_num_work_groups extension #712

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Aympab
Copy link

@Aympab Aympab commented Feb 11, 2025

This PR proposes to add to the specification a device descriptor on the maximum number of work-groups that can be submitted to a range or nd_range parallel for: max_num_work_groups_nd_range<N> and max_num_work_groups_range<N> for N=1, 2 or 3. The device query returns a tuple id<N> with boundaries in each dimension.

Justification

  • In the current revision of the spec, there is no limitation on the maximum iteration size submitted to a parallel for, supposedly meaning that any number should be valid. But when actually running with large sizes, backend-related issues emerge.

  • Users rely on these values to check kernel boundaries and often have to hard-code these values, for example Kokkos developers with the SYCL backend (see this PR) or this implementation of blocking/streaming kernels.

  • The query is already available for all GPU backends:

    • CUDA CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_[X,Y,Z]
    • hip hipDeviceAttributeMaxGridDim[X,Y,Z]
    • L0 ze_device_properties_t::maxGroupCount[X,Y,Z]
  • DPC++ already implements something similar as an extension

Notes

  • There is a distinction for nd_range or range for multiple reasons:

    • The semantics of range and nd_range differ which might impact the maximum size of the iteration space
    • Implementers can choose to map higher-level optimization methods for basic range kernels, or directly map on the lower level limitations
  • For N=3, the mapping is straightforward, as it directly queries the backend functions with X, Y, Z. The mapping for N=1 and N=2 is unclear, implementers could choose the minimum of X,Y,Z or compute a product of other dimensions, for example.

  • Although this PR is largely inspired by DPC++ extension, they initially proposed max_global_work_groups which is not actually queryable with CUDA/HIP which is why it is not proposed here.

    • As a user, the main concern is ensuring that the kernel’s iteration range does not exceed the maximum values for each dimension. The rest should be implementation-defined (e.g., whether the mapping is direct or if blocking/streaming is used).

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2025

CLA assistant check
All committers have signed the CLA.

= sycl_khr_max_num_work_groups

This extension allows developers to query iteration bounds in each dimension for a ND-range or basic range kernel.
The implementation ensures the execution of the ND-range kernel if its global size is less than of equal to `max_num_work_groups_nd_range<N>` in each dimension. This condition applies to basic range kernels with `max_num_work_groups_range<N>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to expose this query separately for ND-range and basic (range) kernels.

According to Section 3.7.2, SYCL kernel execution mode, work-groups only exist for ND-range kernels. A basic kernel is launched only with a number of work-items -- there is no way to specify the number of work-groups to use, nor to query the number of work-groups used by the implementation.

This is why the DPC++ extension only defines max_work_groups for ND-range kernels. I think we should align with that design if we're going to pursue this feature as a KHR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it doesn't make sense to talk about "work-groups" for a basic kernels.
Still, the user cannot check the bounds of a basic range kernel before submission and I think we should be able to query that so the assertion still remains valid.

Maybe a renaming max_num_work_groups_range<N> --> max_global_size_range<N> or max_basic_range<N>? Or maybe this value already exists in the specification?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size, which is defined in Table 135.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR by removing max_num_work_groups_range.
I also renamed max_num_work_groups_nd_range<N> --> max_num_work_groups<N> since it's implicit that it is the maximum for an ND-Range kernel (what about hierarchical kernels, should it be the same?)

Let me know if there should be other changes

Copy link
Contributor

@TApplencourt TApplencourt Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a device-level query, but users can already query the maximum range for a specific kernel using info::kernel_device_specific::global_work_size

I totally forgot about this query. Sorry, @Aympab! Looking at the table It's always return range<3>. So if I want to submit range<1>, I don't know the math to linearize it. Should we then add new queries to the 'kernel_device_specific`?

But no user code seems to be using it, I don't know what that means. Either kernel bumble scares people, people don't know the API, or people don't need it at all (aka either never using range or all implementation support by default iteration space large enough or have a different way of computing it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants