[lc0][SYCL] Use extensions to submit native commands when available #98

rafbiels · 2025-05-06T13:40:17Z

Use the wrapper function from infrastructure/SYCL.h (introduced in #95) to call either host_task or native command submission extensions when available.

All changes are exclusively in the code paths taken by CUDA and HIP backends, so other backends are not affected.

This improves the performance of the SYCL benchmark with CUDA and HIP backends.

Use the wrapper function from infrastructure/SYCL.h to call either host_task or native command submission extensions when available.

jgtong · 2025-05-06T23:32:13Z

Greetings @rafbiels , thanks for your contributions. I will test this internally and get back to you soon.

jgtong · 2025-05-07T17:06:48Z

@rafbiels

I pulled your changes into our CI/CD and there is a performance degradation for SYCL on NVIDIA. At this time, I do not have bandwidth to investigate this.

Do you have any data that you can share before and after the changes on the H100 / A100 ?

rafbiels · 2025-05-08T10:49:04Z

Hi @jgtong,
here's my commands and outputs on H100:

$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2025.1.0 (2025.1.0.20250317)
$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 5418Y OpenCL 3.0 (Build 0) [2025.19.3.0.17_230222]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA H100 PCIe 9.0 [CUDA 12.4]
$ git checkout origin/main
$ git merge myfork/lc0-segfault # see PR 97, otherwise cannot run
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ cp 791556.pb.gz build/release/.
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) :    144
Total Wall time (ms):    962
Total time (ms)     :    794
Nodes searched      :    33092
Nodes/second        :    41625
Nodes/second (Wall) :    40455
$ git merge myfork/lc0-nativecmd 
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) :    144
Total Wall time (ms):    887
Total time (ms)     :    720
Nodes searched      :    33133
Nodes/second        :    45954
Nodes/second (Wall) :    44594

The new version (this PR) gives 10% better performance, i.e. processes 10% more Nodes per second.

jgtong · 2025-05-12T19:47:52Z

@rafbiels

I ran your changes on the H100 and noticed that the workload is producing incorrect results:

bestmove g1f3 << The correct result should be e2e4
Results are incorrect!

===========================
File load time (ms) :    127
Total Wall time (ms):    743
Total time (ms)     :    591
Nodes searched      :    33003
Nodes/second        :    55748
Nodes/second (Wall) :    53576

The environment config is the following:
CUDA: 12.6
clang version:

clang version 21.0.0git (https://github.com/intel/llvm 83b42bc4bbed3635b0d4274d58c2230bb82dbc87)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /localdisk/jaytong/sycl_workspace/llvm/build/bin
Build config: +assertions

rafbiels · 2025-05-13T16:23:56Z

Hi @jgtong,
I re-tested this with the same compiler version as yours on the following systems:

NVIDIA H100 + Intel Xeon Gold 5418Y, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 550 (12.4)
NVIDIA A100 + Intel Xeon Gold 6326, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 555 (12.5)
NVIDIA GeForce RTX 3060 + Intel Core i9-12900K, CUDA Toolkit 12.8, cuDNN 9.4.0, CUDA Driver 570 (12.8)
NVIDIA GeForce RTX 4060 Ti + Intel Core i9-12900K, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 570 (12.8)
NVIDIA TITAN RTX + Intel Xeon Platinum 8268, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 550 (12.4)
NVIDIA L40 + Intel Xeon Gold 6448Y, CUDA Toolkit 12.2, cuDNN 9.0.0, CUDA driver 570 (12.8)
AMD MI210 + 2x AMD EPYC 7402, ROCm 6.3.3, amdgpu driver from Linux kernel 6.2.0
AMD Radeon PRO W6800 + Intel Core i9-12900K, ROCm 6.3.3, amdgpu driver from Linux kernel 6.5.0

and I could not reproduce the issue on any of them. The benchmark is using an in-order queue, so I don't expect any synchronisation issues even if the code wasn't synchronising properly (but I believe it is correct).

Some questions to help narrow this down:

Are you testing the HEAD of main branch + this PR merged locally?
Is your compiler exactly the listed version (equivalent to tag nightly-2025-05-13) without any modifications?
What CPU are you using?
Can you think of any other differences between our setups?

[lc0][SYCL] Use extensions to submit native commands when available

e27bd09

Use the wrapper function from infrastructure/SYCL.h to call either host_task or native command submission extensions when available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lc0][SYCL] Use extensions to submit native commands when available #98

[lc0][SYCL] Use extensions to submit native commands when available #98

Uh oh!

rafbiels commented May 6, 2025

Uh oh!

jgtong commented May 6, 2025

Uh oh!

jgtong commented May 7, 2025

Uh oh!

rafbiels commented May 8, 2025

Uh oh!

jgtong commented May 12, 2025 •

edited

Loading

Uh oh!

rafbiels commented May 13, 2025

Uh oh!

Uh oh!

[lc0][SYCL] Use extensions to submit native commands when available #98

Are you sure you want to change the base?

[lc0][SYCL] Use extensions to submit native commands when available #98

Uh oh!

Conversation

rafbiels commented May 6, 2025

Uh oh!

jgtong commented May 6, 2025

Uh oh!

jgtong commented May 7, 2025

Uh oh!

rafbiels commented May 8, 2025

Uh oh!

jgtong commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafbiels commented May 13, 2025

Uh oh!

Uh oh!

jgtong commented May 12, 2025 •

edited

Loading