🐛 Describe the bug
We are using kernel specific max work group size to avoid platform compatibility issue. The routine is,
auto kid = ::sycl::get_kernel_id<KernelClass>();
auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
ctx, {dev}, {kid});
sycl::kernel k = kbundle.get_kernel(kid);
int max_work_group_size = k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);
sycl::get_kernel_bundles gets severe host overhead. The data is as below,

Impacts: All kernels in torch-xpu-ops launched with kernel specific max work group are impacted.
- 40us overhead is not acceptable for some single batch inference cases, since latency of kernels might be less than 10us.
- CUDA runtime usually spends ~6us for a kernel launch.
intel/llvm#15824
Versions
- torch-xpu-ops: latest main
- Intel DPC++ compiler/rt: 2024.1.3 (2024.1.3.20240604)