codeplaysoftware · AD2605 · Sep 9, 2024 · Sep 9, 2024 · Sep 9, 2024 · Sep 9, 2024
diff --git a/README-sycl.md b/README-sycl.md
diff --git a/README.md b/README.md
@@ -41,6 +41,12 @@ and improves code composability and readability. More documentation specific to
 
 In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
 
+## CUTLASS with SYCL
+CUTLASS 3.0 API now also supports SYCL, and can run on Nvidia(upto the Ampere architecture) and Intel Xe Core architecture GPUs using the SYCL backend using the Intel open source `DPC++` compiler.
+The support is currently limited to GEMMs only. See [Quick Start Guide](./media/docs/build/building_with_sycl_support.md) on how to build and run
+examples using the SYCL backend.
+
+To learn more about Intel SYCL compilers and SYCL, please see [Intel oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html)
 
 # What's New in CUTLASS 3.5
 

diff --git a/media/docs/build/building_with_sycl_support.md b/media/docs/build/building_with_sycl_support.md
@@ -1,6 +1,3 @@
-
-[README](../../../README.md#documentation) > **CUTLASS 3: Building with Clang as Host Compiler**
-
 # Building with SYCL Support
 
 Cutlass 3 can be built with SYCL using the DPC++ compiler, enabling Cutlass
@@ -12,93 +9,71 @@ code for heterogeneous and offload processors to be written with modern
 ISO C++. It provides APIs and abstractions to find devices and manage
 resources for GPUs.
 
-## Limitations
-
-Currently, it's only possible to build five examples in the Cute
-tutorial and a reduced number of Cute tests.
-
-### Cute Tutorial Examples Supported
+## Installing the DPC++ Compiler
+One can use the nightly version of the DPC++ compiler, of which the prebuilt packages can be 
+procured from [here](https://github.com/intel/llvm/releases), or build it from source as [described](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md).
 
-* `sgemm_1`, `sgemm_2` and `tiled_copy`: Generic examples that should run on any
-  SYCL-enabled device. Tested on Nvidia SM80 devices and Intel PVC
-  and Arc devices.
-* `sgemm_sm70`: Nvidia SM70 specific example.
-* `sgemm_sm80`: Nvidia SM80 specific example.
+If building from source for the CUDA backend, a minimum CUDA toolkit version of 12.3 is recommended.
+If using the pre-built nightlies, a nightly dated no older than 2024-08-02 is required.
 
-## Software Prerequisites
+## Building and Running on the SYCL backend
+To build with the Intel open source `DPC++` compiler when using the SYCL backend
+```bash
+$ mkdir build && cd build
 
-To build CUTLASS with SYCL support, you need the latest version of
-the DPC++ compiler. You can either use a recent
-[nightly build](https://github.com/intel/llvm/releases)
-(see the [Setup DPC++ Nightly Build](#setup-dpc-nightly-build) section)
-or build the compiler from source as described in the
-[oneAPI DPC++ guideline](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md#build-dpc-toolchain-with-support-for-nvidia-cuda).
+$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80 .. # compiles for the NVIDIA Ampere GPU architecture
 
-If building for an Nvidia GPU, the CUDA Toolkit will be required
-(tested with version 12.4).
+$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc .. # compiles for the Intel Xe Core Architecture
+```
+A complete example can be as follows (running on the Intel Data Center Max 1100) - 
 
-CMake (at least version 3.18), Ninja, git, and Python
-(at least version 3.6) are also required.
+```bash
+$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..
 
-### Setup DPC++ Nightly Build
+$ make pvc_gemm
 
-To install the nightly build, download DPC++ from the
-[nightly build](https://github.com/intel/llvm/releases). The minimum version
-required is nightly 2024-07-19.
+$ ./examples/sycl/pvc/pvc_gemm
 
-```bash
-$ export DPCPP_HOME=/opt/intel/dpcpp-nightly
-$ mkdir -p $DPCPP_HOME
-$ cd $DPCPP_HOME
-$ wget https://github.com/intel/llvm/releases/download/nightly-2024-07-19/sycl_linux.tar.gz
-$ tar -zxvf sycl_linux.tar.gz
-$ export PATH=$DPCPP_HOME/bin:$PATH
-$ export LD_LIBRARY_PATH=$DPCPP_HOME/lib:$LD_LIBRARY_PATH
-$ export C_INCLUDE_PATH=$DPCPP_HOME/include:$C_INCLUDE_PATH
-$ export CPLUS_INCLUDE_PATH=$DPCPP_HOME/include:$CPLUS_INCLUDE_PATH
-$ export CC=clang
-$ export CXX=clang++
+Disposition: Passed
+Problem Size: 5120x4096x4096x1
+Cutlass GEMM Performance:     [225.773]TFlop/s  (0.7609)ms
 ```
+More examples on the Intel GPU can be found in the [sycl example folder](../../examples/sycl/pvc/)
 
-# Running CMake
-
-## Required CMake options
-
-The SYCL build requires specifying the following CMake options and
-environmental variables. Replace `<path-to-clang>` and `<path-to-clang++>`
-with the path to your clang and clang++ executables. You may use `clang`
-and `clang++` directly if they are in your `PATH`.
+A complete example when running on a A100, using the SYCL backend
 
 ```bash
-$ export CC=<path-to-clang>/clang
-$ export CXX=<path-to-clang++>/clang++
-```
+$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80
 
-### CMake options for Nvidia devices
+$ make 14_ampere_tf32_tensorop_gemm_cute
 
-* `DPCPP_SYCL_TARGET=nvptx64-nvidia-cuda` sets the triplet target for Nvidia GPUs.
-* `DPCPP_SYCL_ARCH=sm_80` sets the device architecture to SM_80.
+$ ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute 
 
-```bash
-$ cmake -G Ninja .. \
-    -DCUTLASS_ENABLE_SYCL=ON \
-    -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda \
-    -DDPCPP_SYCL_ARCH=sm_80
+  Disposition: Passed
+  Problem Size: 5120x4096x4096x1
+  Avg runtime: 1.5232 ms
+  GFLOPS: 112788
 ```
 
-Building Cutlass with SYCL support on Nvidia devices has been tested
-on an A100 device with `DPCPP_SYCL_ARCH=sm_80`.
+## Supported CUTLASS and CUTE Examples
+Currently, not all CUTLASS and CUTE examples are supported with the SYCL backend.
+as of now, the following are supported - 
 
-### CMake options for Intel devices
+CUTE Examples <br>
+  * All the CUTE tutorials except `wgmma_sm90` is supported, of which
+    * `sgemm_1`, `sgemm_2`, and `tiled_copy` can run on any SYCL device
+    * `sgemm_sm80` and `sgemm_sm70` are Nvidia Ampere and Turing specific examples respectively.
 
-* `DPCPP_SYCL_TARGET=intel_gpu_pvc` sets the triplet target for Intel PVC GPUs.
+CUTLASS Examples <br>
+  * Example 14
+  * We also provide various SYCL examples for the Intel Data Center Max range of GPUs
+
+## SYCL Supported Architectures and APIs
+At the time of writing, the SYCL backend supports all Nvidia architectures till Ampere, and the 
+Intel Data Center Max series of GPUs is supported.
 
-```bash
-$ cmake -G Ninja .. \
-    -DCUTLASS_ENABLE_SYCL=ON \
-    -DDPCPP_SYCL_TARGET=intel_gpu_pvc
-```
+We support the `CollectiveMMA` and the collective builder APIs for the same.
 
 # References
 
-[1] https://www.khronos.org/sycl/
+[1] https://www.khronos.org/sycl/
diff --git a/media/docs/quickstart.md b/media/docs/quickstart.md
@@ -11,14 +11,16 @@ CUTLASS requires:
 - CMake 3.18+
 - host compiler supporting C++17 or greater (minimum g++ 7.5.0)
 - Python 3.6+
+- For the SYCL backend, an installation of the open source `DPC++` compiler, which 
+  can be found [here](https://github.com/intel/llvm)
 
 CUTLASS may be optionally compiled and linked with
 - cuBLAS
 - cuDNN v7.6 or later
 
 ## Initial build steps
 
-Construct a build directory and run CMake.
+Construct a build directory and run CMake if using the CUDA toolchain.
 ```bash
 $ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
 
@@ -27,6 +29,47 @@ $ mkdir build && cd build
 $ cmake .. -DCUTLASS_NVCC_ARCHS=90a             # compiles for NVIDIA Hopper GPU architecture
 ```
 
+## Building and Running on the SYCL backend
+To build with the Intel open source `DPC++` compiler when using the SYCL backend
+```bash
+$ mkdir build && cd build
+
+$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80 .. # compiles for the NVIDIA Ampere GPU architecture
+
+$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc .. # compiles for the Intel Xe Core Architecture
+```
+A complete example can be as follows (running on the Intel Data Center Max 1100) - 
+
+```bash
+$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..
+
+$ make pvc_gemm
+
+$ ./examples/sycl/pvc/pvc_gemm
+
+Disposition: Passed
+Problem Size: 5120x4096x4096x1
+Cutlass GEMM Performance:     [225.773]TFlop/s  (0.7609)ms
+```
+More examples on the Intel GPU can be found in the [sycl example folder](../../examples/sycl/pvc/)
+
+A complete example when running on a A100, using the SYCL backend
+
+```bash
+$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80
+
+$ make 14_ampere_tf32_tensorop_gemm_cute
+
+$ ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute 
+
+  Disposition: Passed
+  Problem Size: 5120x4096x4096x1
+  Avg runtime: 1.5232 ms
+  GFLOPS: 112788
+```
+
+### CUTLASS quick building  tips
+
 If your goal is strictly to build only the CUTLASS Profiler and to minimize compilation time, we suggest
 executing the following CMake command in an empty `build/` directory.
 ```bash