-
Notifications
You must be signed in to change notification settings - Fork 17
[lc0][SYCL] Use extensions to submit native commands when available #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Use the wrapper function from infrastructure/SYCL.h to call either host_task or native command submission extensions when available.
Greetings @rafbiels , thanks for your contributions. I will test this internally and get back to you soon. |
I pulled your changes into our CI/CD and there is a performance degradation for SYCL on NVIDIA. At this time, I do not have bandwidth to investigate this. Do you have any data that you can share before and after the changes on the H100 / A100 ? |
Hi @jgtong, $ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2025.1.0 (2025.1.0.20250317)
$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 5418Y OpenCL 3.0 (Build 0) [2025.19.3.0.17_230222]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA H100 PCIe 9.0 [CUDA 12.4]
$ git checkout origin/main
$ git merge myfork/lc0-segfault # see PR 97, otherwise cannot run
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ cp 791556.pb.gz build/release/.
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) : 144
Total Wall time (ms): 962
Total time (ms) : 794
Nodes searched : 33092
Nodes/second : 41625
Nodes/second (Wall) : 40455
$ git merge myfork/lc0-nativecmd
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) : 144
Total Wall time (ms): 887
Total time (ms) : 720
Nodes searched : 33133
Nodes/second : 45954
Nodes/second (Wall) : 44594 The new version (this PR) gives 10% better performance, i.e. processes 10% more Nodes per second. |
I ran your changes on the H100 and noticed that the workload is producing incorrect results:
The environment config is the following:
|
Hi @jgtong,
and I could not reproduce the issue on any of them. The benchmark is using an in-order queue, so I don't expect any synchronisation issues even if the code wasn't synchronising properly (but I believe it is correct). Some questions to help narrow this down:
|
Use the wrapper function from
infrastructure/SYCL.h
(introduced in #95) to call eitherhost_task
or native command submission extensions when available.All changes are exclusively in the code paths taken by CUDA and HIP backends, so other backends are not affected.
This improves the performance of the SYCL benchmark with CUDA and HIP backends.