Skip to content

[lc0][SYCL] Use extensions to submit native commands when available #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rafbiels
Copy link
Contributor

@rafbiels rafbiels commented May 6, 2025

Use the wrapper function from infrastructure/SYCL.h (introduced in #95) to call either host_task or native command submission extensions when available.

All changes are exclusively in the code paths taken by CUDA and HIP backends, so other backends are not affected.

This improves the performance of the SYCL benchmark with CUDA and HIP backends.

Use the wrapper function from infrastructure/SYCL.h to call either
host_task or native command submission extensions when available.
@jgtong
Copy link
Contributor

jgtong commented May 6, 2025

Greetings @rafbiels , thanks for your contributions. I will test this internally and get back to you soon.

@jgtong
Copy link
Contributor

jgtong commented May 7, 2025

@rafbiels

I pulled your changes into our CI/CD and there is a performance degradation for SYCL on NVIDIA. At this time, I do not have bandwidth to investigate this.

Do you have any data that you can share before and after the changes on the H100 / A100 ?

@rafbiels
Copy link
Contributor Author

rafbiels commented May 8, 2025

Hi @jgtong,
here's my commands and outputs on H100:

$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2025.1.0 (2025.1.0.20250317)
$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 5418Y OpenCL 3.0 (Build 0) [2025.19.3.0.17_230222]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA H100 PCIe 9.0 [CUDA 12.4]
$ git checkout origin/main
$ git merge myfork/lc0-segfault # see PR 97, otherwise cannot run
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ cp 791556.pb.gz build/release/.
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) :    144
Total Wall time (ms):    962
Total time (ms)     :    794
Nodes searched      :    33092
Nodes/second        :    41625
Nodes/second (Wall) :    40455
$ git merge myfork/lc0-nativecmd 
$ CXX=icpx CC=icx ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=90
$ ./build/release/lc0_sycl benchmark -b sycl
[...]
===========================
File load time (ms) :    144
Total Wall time (ms):    887
Total time (ms)     :    720
Nodes searched      :    33133
Nodes/second        :    45954
Nodes/second (Wall) :    44594

The new version (this PR) gives 10% better performance, i.e. processes 10% more Nodes per second.

@jgtong
Copy link
Contributor

jgtong commented May 12, 2025

@rafbiels

I ran your changes on the H100 and noticed that the workload is producing incorrect results:

bestmove g1f3 << The correct result should be e2e4
Results are incorrect!

===========================
File load time (ms) :    127
Total Wall time (ms):    743
Total time (ms)     :    591
Nodes searched      :    33003
Nodes/second        :    55748
Nodes/second (Wall) :    53576

The environment config is the following:
CUDA: 12.6
clang version:

clang version 21.0.0git (https://github.com/intel/llvm 83b42bc4bbed3635b0d4274d58c2230bb82dbc87)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /localdisk/jaytong/sycl_workspace/llvm/build/bin
Build config: +assertions

@rafbiels
Copy link
Contributor Author

Hi @jgtong,
I re-tested this with the same compiler version as yours on the following systems:

  • NVIDIA H100 + Intel Xeon Gold 5418Y, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 550 (12.4)
  • NVIDIA A100 + Intel Xeon Gold 6326, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 555 (12.5)
  • NVIDIA GeForce RTX 3060 + Intel Core i9-12900K, CUDA Toolkit 12.8, cuDNN 9.4.0, CUDA Driver 570 (12.8)
  • NVIDIA GeForce RTX 4060 Ti + Intel Core i9-12900K, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 570 (12.8)
  • NVIDIA TITAN RTX + Intel Xeon Platinum 8268, CUDA Toolkit 12.8, cuDNN 9.7.0, CUDA driver 550 (12.4)
  • NVIDIA L40 + Intel Xeon Gold 6448Y, CUDA Toolkit 12.2, cuDNN 9.0.0, CUDA driver 570 (12.8)
  • AMD MI210 + 2x AMD EPYC 7402, ROCm 6.3.3, amdgpu driver from Linux kernel 6.2.0
  • AMD Radeon PRO W6800 + Intel Core i9-12900K, ROCm 6.3.3, amdgpu driver from Linux kernel 6.5.0

and I could not reproduce the issue on any of them. The benchmark is using an in-order queue, so I don't expect any synchronisation issues even if the code wasn't synchronising properly (but I believe it is correct).

Some questions to help narrow this down:

  1. Are you testing the HEAD of main branch + this PR merged locally?
  2. Is your compiler exactly the listed version (equivalent to tag nightly-2025-05-13) without any modifications?
  3. What CPU are you using?
  4. Can you think of any other differences between our setups?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants