Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle.

### 🐛 Describe the bug

We are using kernel specific max work group size to avoid platform compatibility issue. The routine is,
```
  auto kid = ::sycl::get_kernel_id<KernelClass>();
  auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
      ctx, {dev}, {kid});
  sycl::kernel k = kbundle.get_kernel(kid);
  int max_work_group_size =  k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev); 
```
sycl::get_kernel_bundles gets severe host overhead. The data is as below,
![image](https://github.com/user-attachments/assets/50788d43-8598-4167-b031-766de9487044)

Impacts: All kernels in torch-xpu-ops launched with kernel specific max work group are impacted.
1. 40us overhead is not acceptable for some single batch inference cases, since latency of kernels might be less than 10us.
2. CUDA runtime usually spends ~6us for a kernel launch.

https://github.com/intel/llvm/issues/15824

### Versions

- torch-xpu-ops: latest main
- Intel DPC++ compiler/rt: 2024.1.3 (2024.1.3.20240604)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle. #1016

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance: Host overhead: Severe host overhead in sycl::get_kernel_bundle. #1016

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions