[SYCL] Implement free function kernel enqueue functions #20698

lbushi25 · 2025-11-20T05:55:40Z

Implement the new enqueue functions for free function kernels that were added in #19995

sycl/include/sycl/ext/oneapi/experimental/enqueue_functions.hpp

aelovikov-intel · 2025-11-25T20:32:53Z

sycl/include/sycl/ext/oneapi/experimental/enqueue_functions.hpp

+  submit(Q, [&](handler &CGH) {
+    single_task(CGH, KernelFunc, std::forward<ArgsT>(Args)...);
+  });


Do we have a submit_direct* version of this?

Do we have a submit_direct* version of this? Please sync with @slawekptak to implement it properly from the start rather than create more future work for him.

No, there is no submit_direct* version of this in the spec.

We can have it in detail:: still. Also, queue::* itself can act as submit_direct.

I chose to keep it like this rather than use something like queue::single_task in order to not deviate from the implementation of the other functions in this file where the queue version of the function delegates to the handler version of the function.

Correct. But it means that we need to create this infrastructure. My initial understanding is that the implementation of the free function kernel enqueue should be implemented using handler-less path (submit_kernel_direct_*). Do we have examples where we really need a handler in this case?

Tagging @gmlueck. If need be, I will revamp this PR to implement handler-less path for free function kernels but I'd like to consult the spec writers first.

What is the question? Are you asking if it is important to optimize the "submit" functions that take free-function kernels? The answer to that is "yes". In fact, the team asking for the free-function kernels is the same team that wants to reduce the launch overhead. Therefore, I'm certain that they will also care about the launch overhead when using free function kernels.

If the question is specifically about single_task, then the answer is less clear. I doubt that team will use single_task. However, they will definitely use nd_launch, so we should optimize that case. If you add the optimized code for nd_launch, will it be easy to do the same thing for single_task? If so, it seems like you may as well optimize them both.

It is a question of the implementation, not about the spec. Did I miss something?

The spec team can give insight into the intention of the end-user client that motivated the design of the spec which can potentially impact implementation decisions. Greg's response cleared this up for me.

aelovikov-intel · 2025-11-25T20:44:43Z

sycl/include/sycl/ext/oneapi/experimental/enqueue_functions.hpp

+  queue Q = CGH.getQueue();
+  sycl::kernel_bundle Bndl =
+      get_kernel_bundle<Func, sycl::bundle_state::executable>(Q.get_context());


This creates and destroys two std::shared_ptrs for almost no reason. IMO, we should fix this getQueue() hack while we're in an ABI breaking window. Maybe by changing handler_impl to store a reference to the sycl::queue it was created with? handler_impl::MQueueOrGraph isn't used directly outside a few getters, so the change should be very simple.

Can you elaborate a bit please?
What shared pointers are you referring to?

llvm/sycl/include/sycl/queue.hpp

Lines 3744 to 3745 in bb941bd

std::shared_ptr<detail::queue_impl> impl;

queue(std::shared_ptr<detail::queue_impl> impl) : impl(impl) {}

and similar for sycl::context.

So, based on the discussion in #20764 and the fact that these remarks are in my opinion out of scope for this PR, I suggest these be handled separately. The getQueue and getContext functions used in my code here should then automatically reap the benefits of that refactoring without requiring a lot of changes(hopefully none).

aelovikov-intel · 2025-11-25T20:50:10Z

sycl/include/sycl/handler.hpp

+template <auto *> struct kernel_function_s;
+template <auto *Func, typename... Args>
+void single_task(handler &, kernel_function_s<Func>, Args &&...);
+template <auto *Func, int Dimensions, typename... Args>
+void nd_launch(handler &, nd_range<Dimensions>, kernel_function_s<Func>,
+               Args &&...);
+template <auto *Func, int Dimensions, typename Properties, typename... Args>
+void nd_launch(handler &, launch_config<nd_range<Dimensions>, Properties>,
+               kernel_function_s<Func>, Args &&...);


Is all of that just for handler::getQueue()? Can you extend

llvm/sycl/include/sycl/handler.hpp

Lines 3226 to 3280 in 259c433

namespace detail {

class HandlerAccess {

public:

static void internalProfilingTagImpl(handler &Handler) {

Handler.internalProfilingTagImpl();

}

template <typename RangeT, typename PropertiesT>

static void parallelForImpl(handler &Handler, RangeT Range, PropertiesT Props,

kernel Kernel) {

Handler.parallel_for_impl(Range, Props, Kernel);

}

static void swap(handler &LHS, handler &RHS) {

std::swap(LHS.implOwner, RHS.implOwner);

std::swap(LHS.impl, RHS.impl);

std::swap(LHS.MLocalAccStorage, RHS.MLocalAccStorage);

std::swap(LHS.MStreamStorage, RHS.MStreamStorage);

std::swap(LHS.MKernelName, RHS.MKernelName);

std::swap(LHS.MKernel, RHS.MKernel);

std::swap(LHS.MSrcPtr, RHS.MSrcPtr);

std::swap(LHS.MDstPtr, RHS.MDstPtr);

std::swap(LHS.MLength, RHS.MLength);

std::swap(LHS.MPattern, RHS.MPattern);

std::swap(LHS.MHostKernel, RHS.MHostKernel);

std::swap(LHS.MCodeLoc, RHS.MCodeLoc);

}

// pre/postProcess are used only for reductions right now, but the

// abstractions they provide aren't reduction-specific. The main problem they

// solve is

//

// # User code

// q.submit([&](handler &cgh) {

// set_dependencies(cgh);

// enqueue_whatever(cgh);

// }); // single submission

//

// that needs to be implemented as multiple enqueues involving

// pre-/post-processing internally. SYCL prohibits recursive submits from

// inside control group function object (lambda above) so we need some

// internal interface to implement that.

__SYCL_EXPORT static void preProcess(handler &CGH, type_erased_cgfo_ty F);

__SYCL_EXPORT static void postProcess(handler &CGH, type_erased_cgfo_ty F);

template <class FunctorTy>

static void preProcess(handler &CGH, FunctorTy &Func) {

preProcess(CGH, type_erased_cgfo_ty{Func});

}

template <class FunctorTy>

static void postProcess(handler &CGH, FunctorTy &Func) {

postProcess(CGH, type_erased_cgfo_ty{Func});

}

};

} // namespace detail

instead?

Yes, that has been added to access getQueue. Now that you've brought HandlerAccess to my attention, it seems like a better solution so I'll try to migrate it over there instead.

I've added the getQueue function to HandlerAccess that just dispatches to the getQueue function of the handler.

…llvm into enqueue_free_functions

lbushi25 · 2025-12-11T01:30:12Z

@vinser52 ping for review. I went with an approach where I wrap the free function in a lambda in order to exploit the already existing infrastructure for direct submission of lambda kernels. The wrapper lambda itself is very lightweight.

Implement free function kernel enqueue functions

b5ac52a

lbushi25 requested a review from a team as a code owner November 20, 2025 05:55

lbushi25 requested a review from cperkinsintel November 20, 2025 05:55

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 05:55 — with GitHub Actions Error

Remove unused code

63d860c

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 05:57 — with GitHub Actions Error

Improve comments

00e0f0d

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 05:59 — with GitHub Actions Error

Fix LIT command typo

cd92d0c

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 06:01 — with GitHub Actions Failure

Fix compilation error

4621ff6

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 06:12 — with GitHub Actions Error

Fix unused argument error

76e0f8b

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 06:31 — with GitHub Actions Failure

lbushi25 temporarily deployed to WindowsCILock November 20, 2025 07:00 — with GitHub Actions Inactive

Fix unit-tests failures

ce2a16b

lbushi25 requested a deployment to WindowsCILock November 20, 2025 14:59 — with GitHub Actions In progress

Fix formatting

e88b0f9

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 15:16 — with GitHub Actions Error

Add XFAIL for native CPU

b1b3ce9

lbushi25 temporarily deployed to WindowsCILock November 20, 2025 16:03 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 16:31 — with GitHub Actions Failure

lbushi25 temporarily deployed to WindowsCILock November 20, 2025 16:31 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock November 20, 2025 17:19 — with GitHub Actions Failure

lbushi25 temporarily deployed to WindowsCILock November 20, 2025 17:48 — with GitHub Actions Inactive

Add more tests

8b685ea

lbushi25 had a problem deploying to WindowsCILock November 21, 2025 18:43 — with GitHub Actions Error

lbushi25 temporarily deployed to WindowsCILock November 24, 2025 18:45 — with GitHub Actions Inactive

aelovikov-intel requested changes Nov 25, 2025

View reviewed changes

gmlueck mentioned this pull request Dec 1, 2025

[sycl free func] The example in sycl_ext_oneapi_free_function_kernels.asciidoc can't be compiled successfully #20751

Open

lbushi25 added 3 commits December 1, 2025 21:28

Apply requested changes

0b6a0ac

Merge branch 'enqueue_free_functions' of https://github.com/lbushi25/…

a7c592e

…llvm into enqueue_free_functions

Some more refactoring

c65ffbc

lbushi25 had a problem deploying to WindowsCILock December 2, 2025 06:11 — with GitHub Actions Failure

lbushi25 requested a review from aelovikov-intel December 2, 2025 06:23

lbushi25 temporarily deployed to WindowsCILock December 2, 2025 06:44 — with GitHub Actions Inactive

lbushi25 requested review from slawekptak and vinser52 December 2, 2025 16:57

Apply feedback

5b7c7de

lbushi25 had a problem deploying to WindowsCILock December 10, 2025 23:08 — with GitHub Actions Error

Remove dead code

356b55a

lbushi25 had a problem deploying to WindowsCILock December 10, 2025 23:16 — with GitHub Actions Error

Remove more dead code

9830ba4

lbushi25 temporarily deployed to WindowsCILock December 10, 2025 23:17 — with GitHub Actions Inactive

lbushi25 temporarily deployed to WindowsCILock December 10, 2025 23:45 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock December 10, 2025 23:45 — with GitHub Actions Failure

Add more tests

34b0b17

lbushi25 temporarily deployed to WindowsCILock December 11, 2025 01:22 — with GitHub Actions Inactive

lbushi25 temporarily deployed to WindowsCILock December 11, 2025 02:00 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock December 11, 2025 02:00 — with GitHub Actions Failure

	std::shared_ptr<detail::queue_impl> impl;
	queue(std::shared_ptr<detail::queue_impl> impl) : impl(impl) {}

	namespace detail {
	class HandlerAccess {
	public:
	static void internalProfilingTagImpl(handler &Handler) {
	Handler.internalProfilingTagImpl();
	}

	template <typename RangeT, typename PropertiesT>
	static void parallelForImpl(handler &Handler, RangeT Range, PropertiesT Props,
	kernel Kernel) {
	Handler.parallel_for_impl(Range, Props, Kernel);
	}

	static void swap(handler &LHS, handler &RHS) {
	std::swap(LHS.implOwner, RHS.implOwner);
	std::swap(LHS.impl, RHS.impl);
	std::swap(LHS.MLocalAccStorage, RHS.MLocalAccStorage);
	std::swap(LHS.MStreamStorage, RHS.MStreamStorage);
	std::swap(LHS.MKernelName, RHS.MKernelName);
	std::swap(LHS.MKernel, RHS.MKernel);
	std::swap(LHS.MSrcPtr, RHS.MSrcPtr);
	std::swap(LHS.MDstPtr, RHS.MDstPtr);
	std::swap(LHS.MLength, RHS.MLength);
	std::swap(LHS.MPattern, RHS.MPattern);
	std::swap(LHS.MHostKernel, RHS.MHostKernel);
	std::swap(LHS.MCodeLoc, RHS.MCodeLoc);
	}

	// pre/postProcess are used only for reductions right now, but the
	// abstractions they provide aren't reduction-specific. The main problem they
	// solve is
	//
	// # User code
	// q.submit([&](handler &cgh) {
	// set_dependencies(cgh);
	// enqueue_whatever(cgh);
	// }); // single submission
	//
	// that needs to be implemented as multiple enqueues involving
	// pre-/post-processing internally. SYCL prohibits recursive submits from
	// inside control group function object (lambda above) so we need some
	// internal interface to implement that.
	__SYCL_EXPORT static void preProcess(handler &CGH, type_erased_cgfo_ty F);
	__SYCL_EXPORT static void postProcess(handler &CGH, type_erased_cgfo_ty F);

	template <class FunctorTy>
	static void preProcess(handler &CGH, FunctorTy &Func) {
	preProcess(CGH, type_erased_cgfo_ty{Func});
	}
	template <class FunctorTy>
	static void postProcess(handler &CGH, FunctorTy &Func) {
	postProcess(CGH, type_erased_cgfo_ty{Func});
	}
	};
	} // namespace detail

[SYCL] Implement free function kernel enqueue functions #20698

Are you sure you want to change the base?

[SYCL] Implement free function kernel enqueue functions #20698

Uh oh!

Conversation

lbushi25 commented Nov 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lbushi25 Dec 3, 2025 •

edited

Loading

lbushi25 Dec 2, 2025 •

edited

Loading

lbushi25 Dec 2, 2025 •

edited

Loading