-
Notifications
You must be signed in to change notification settings - Fork 801
[SYCL] Implement free function kernel enqueue functions #20698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
sycl/include/sycl/ext/oneapi/experimental/enqueue_functions.hpp
Outdated
Show resolved
Hide resolved
| submit(Q, [&](handler &CGH) { | ||
| single_task(CGH, KernelFunc, std::forward<ArgsT>(Args)...); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a submit_direct* version of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a
submit_direct*version of this? Please sync with @slawekptak to implement it properly from the start rather than create more future work for him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no submit_direct* version of this in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have it in detail:: still. Also, queue::* itself can act as submit_direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to keep it like this rather than use something like queue::single_task in order to not deviate from the implementation of the other functions in this file where the queue version of the function delegates to the handler version of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. But it means that we need to create this infrastructure. My initial understanding is that the implementation of the free function kernel enqueue should be implemented using handler-less path (submit_kernel_direct_*). Do we have examples where we really need a handler in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging @gmlueck. If need be, I will revamp this PR to implement handler-less path for free function kernels but I'd like to consult the spec writers first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the question? Are you asking if it is important to optimize the "submit" functions that take free-function kernels? The answer to that is "yes". In fact, the team asking for the free-function kernels is the same team that wants to reduce the launch overhead. Therefore, I'm certain that they will also care about the launch overhead when using free function kernels.
If the question is specifically about single_task, then the answer is less clear. I doubt that team will use single_task. However, they will definitely use nd_launch, so we should optimize that case. If you add the optimized code for nd_launch, will it be easy to do the same thing for single_task? If so, it seems like you may as well optimize them both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a question of the implementation, not about the spec. Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec team can give insight into the intention of the end-user client that motivated the design of the spec which can potentially impact implementation decisions. Greg's response cleared this up for me.
| queue Q = CGH.getQueue(); | ||
| sycl::kernel_bundle Bndl = | ||
| get_kernel_bundle<Func, sycl::bundle_state::executable>(Q.get_context()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates and destroys two std::shared_ptrs for almost no reason. IMO, we should fix this getQueue() hack while we're in an ABI breaking window. Maybe by changing handler_impl to store a reference to the sycl::queue it was created with? handler_impl::MQueueOrGraph isn't used directly outside a few getters, so the change should be very simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit please?
What shared pointers are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm/sycl/include/sycl/queue.hpp
Lines 3744 to 3745 in bb941bd
| std::shared_ptr<detail::queue_impl> impl; | |
| queue(std::shared_ptr<detail::queue_impl> impl) : impl(impl) {} |
and similar for
sycl::context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, based on the discussion in #20764 and the fact that these remarks are in my opinion out of scope for this PR, I suggest these be handled separately. The getQueue and getContext functions used in my code here should then automatically reap the benefits of that refactoring without requiring a lot of changes(hopefully none).
sycl/include/sycl/handler.hpp
Outdated
| template <auto *> struct kernel_function_s; | ||
| template <auto *Func, typename... Args> | ||
| void single_task(handler &, kernel_function_s<Func>, Args &&...); | ||
| template <auto *Func, int Dimensions, typename... Args> | ||
| void nd_launch(handler &, nd_range<Dimensions>, kernel_function_s<Func>, | ||
| Args &&...); | ||
| template <auto *Func, int Dimensions, typename Properties, typename... Args> | ||
| void nd_launch(handler &, launch_config<nd_range<Dimensions>, Properties>, | ||
| kernel_function_s<Func>, Args &&...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all of that just for handler::getQueue()? Can you extend
llvm/sycl/include/sycl/handler.hpp
Lines 3226 to 3280 in 259c433
| namespace detail { | |
| class HandlerAccess { | |
| public: | |
| static void internalProfilingTagImpl(handler &Handler) { | |
| Handler.internalProfilingTagImpl(); | |
| } | |
| template <typename RangeT, typename PropertiesT> | |
| static void parallelForImpl(handler &Handler, RangeT Range, PropertiesT Props, | |
| kernel Kernel) { | |
| Handler.parallel_for_impl(Range, Props, Kernel); | |
| } | |
| static void swap(handler &LHS, handler &RHS) { | |
| std::swap(LHS.implOwner, RHS.implOwner); | |
| std::swap(LHS.impl, RHS.impl); | |
| std::swap(LHS.MLocalAccStorage, RHS.MLocalAccStorage); | |
| std::swap(LHS.MStreamStorage, RHS.MStreamStorage); | |
| std::swap(LHS.MKernelName, RHS.MKernelName); | |
| std::swap(LHS.MKernel, RHS.MKernel); | |
| std::swap(LHS.MSrcPtr, RHS.MSrcPtr); | |
| std::swap(LHS.MDstPtr, RHS.MDstPtr); | |
| std::swap(LHS.MLength, RHS.MLength); | |
| std::swap(LHS.MPattern, RHS.MPattern); | |
| std::swap(LHS.MHostKernel, RHS.MHostKernel); | |
| std::swap(LHS.MCodeLoc, RHS.MCodeLoc); | |
| } | |
| // pre/postProcess are used only for reductions right now, but the | |
| // abstractions they provide aren't reduction-specific. The main problem they | |
| // solve is | |
| // | |
| // # User code | |
| // q.submit([&](handler &cgh) { | |
| // set_dependencies(cgh); | |
| // enqueue_whatever(cgh); | |
| // }); // single submission | |
| // | |
| // that needs to be implemented as multiple enqueues involving | |
| // pre-/post-processing internally. SYCL prohibits recursive submits from | |
| // inside control group function object (lambda above) so we need some | |
| // internal interface to implement that. | |
| __SYCL_EXPORT static void preProcess(handler &CGH, type_erased_cgfo_ty F); | |
| __SYCL_EXPORT static void postProcess(handler &CGH, type_erased_cgfo_ty F); | |
| template <class FunctorTy> | |
| static void preProcess(handler &CGH, FunctorTy &Func) { | |
| preProcess(CGH, type_erased_cgfo_ty{Func}); | |
| } | |
| template <class FunctorTy> | |
| static void postProcess(handler &CGH, FunctorTy &Func) { | |
| postProcess(CGH, type_erased_cgfo_ty{Func}); | |
| } | |
| }; | |
| } // namespace detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that has been added to access getQueue. Now that you've brought HandlerAccess to my attention, it seems like a better solution so I'll try to migrate it over there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the getQueue function to HandlerAccess that just dispatches to the getQueue function of the handler.
…llvm into enqueue_free_functions
|
@vinser52 ping for review. I went with an approach where I wrap the free function in a lambda in order to exploit the already existing infrastructure for direct submission of lambda kernels. The wrapper lambda itself is very lightweight. |
Implement the new enqueue functions for free function kernels that were added in #19995