Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README #137

Open
wants to merge 13 commits into
base: sycl-develop
Choose a base branch
from
68 changes: 0 additions & 68 deletions README-sycl.md

This file was deleted.

6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ and improves code composability and readability. More documentation specific to

In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

## CUTLASS with SYCL
CUTLASS 3.0 API now also supports SYCL, and can run on Nvidia(upto the Ampere architecture) and Intel PVC GPUs using the SYCL backend using the Intel open source `DPC++` compiler.
The support is currently limited to GEMMs only. See [Quick Start Guide](./media/docs/build/building_with_sycl_support.md) on how to build and run
examples using the SYCL backend.

To learn more about Intel SYCL compilers and SYCL, please see [Intel oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html)

# What's New in CUTLASS 3.5

Expand Down
111 changes: 42 additions & 69 deletions media/docs/build/building_with_sycl_support.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@

[README](../../../README.md#documentation) > **CUTLASS 3: Building with Clang as Host Compiler**

# Building with SYCL Support

Cutlass 3 can be built with SYCL using the DPC++ compiler, enabling Cutlass
Expand All @@ -12,93 +9,69 @@ code for heterogeneous and offload processors to be written with modern
ISO C++. It provides APIs and abstractions to find devices and manage
resources for GPUs.

## Limitations
## Installing the DPC++ Compiler
One can use the nightly version of the DPC++ compiler, of which the prebuilt packages can be
procured from [here](https://github.com/intel/llvm/releases), or build it from source as [described](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md).

Currently, it's only possible to build five examples in the Cute
tutorial and a reduced number of Cute tests.
If building from source for the CUDA backend, a minimum CUDA toolkit version of 12.3 is recommended.
If using the pre-built nightlies, a nightly dated no older than 2024-08-02 is required.

### Cute Tutorial Examples Supported

* `sgemm_1`, `sgemm_2` and `tiled_copy`: Generic examples that should run on any
SYCL-enabled device. Tested on Nvidia SM80 devices and Intel PVC
and Arc devices.
* `sgemm_sm70`: Nvidia SM70 specific example.
* `sgemm_sm80`: Nvidia SM80 specific example.
## Building and Running on the SYCL backend
To build with the Intel open source `DPC++` compiler when using the SYCL backend
```bash
$ mkdir build && cd build

## Software Prerequisites
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
#set the CC and CXX compilers
export CC=/path/to/dpcpp/clang++
export CXX=/path/to/dpcpp/clang

To build CUTLASS with SYCL support, you need the latest version of
the DPC++ compiler. You can either use a recent
[nightly build](https://github.com/intel/llvm/releases)
(see the [Setup DPC++ Nightly Build](#setup-dpc-nightly-build) section)
or build the compiler from source as described in the
[oneAPI DPC++ guideline](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md#build-dpc-toolchain-with-support-for-nvidia-cuda).
# compiles for the NVIDIA Ampere GPU architecture
$ cmake -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80 ..

If building for an Nvidia GPU, the CUDA Toolkit will be required
(tested with version 12.4).
# compiles for the Intel Xe Core Architecture
$ cmake -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..
```
A complete example can be as follows (running on the Intel Data Center Max 1100) -

CMake (at least version 3.18), Ninja, git, and Python
(at least version 3.6) are also required.
```bash
$ cmake -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..

### Setup DPC++ Nightly Build
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
$ make pvc_gemm

To install the nightly build, download DPC++ from the
[nightly build](https://github.com/intel/llvm/releases). The minimum version
required is nightly 2024-07-19.
$ ./examples/sycl/pvc/pvc_gemm

```bash
$ export DPCPP_HOME=/opt/intel/dpcpp-nightly
$ mkdir -p $DPCPP_HOME
$ cd $DPCPP_HOME
$ wget https://github.com/intel/llvm/releases/download/nightly-2024-07-19/sycl_linux.tar.gz
$ tar -zxvf sycl_linux.tar.gz
$ export PATH=$DPCPP_HOME/bin:$PATH
$ export LD_LIBRARY_PATH=$DPCPP_HOME/lib:$LD_LIBRARY_PATH
$ export C_INCLUDE_PATH=$DPCPP_HOME/include:$C_INCLUDE_PATH
$ export CPLUS_INCLUDE_PATH=$DPCPP_HOME/include:$CPLUS_INCLUDE_PATH
$ export CC=clang
$ export CXX=clang++
```
More examples on the Intel GPU can be found in the [sycl example folder](../../examples/sycl/pvc/)

# Running CMake

## Required CMake options

The SYCL build requires specifying the following CMake options and
environmental variables. Replace `<path-to-clang>` and `<path-to-clang++>`
with the path to your clang and clang++ executables. You may use `clang`
and `clang++` directly if they are in your `PATH`.
A complete example when running on a A100, using the SYCL backend

```bash
$ export CC=<path-to-clang>/clang
$ export CXX=<path-to-clang++>/clang++
```
$ cmake -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80

### CMake options for Nvidia devices
$ make 14_ampere_tf32_tensorop_gemm_cute

* `DPCPP_SYCL_TARGET=nvptx64-nvidia-cuda` sets the triplet target for Nvidia GPUs.
* `DPCPP_SYCL_ARCH=sm_80` sets the device architecture to SM_80.
$ ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute

```bash
$ cmake -G Ninja .. \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda \
-DDPCPP_SYCL_ARCH=sm_80
```

Building Cutlass with SYCL support on Nvidia devices has been tested
on an A100 device with `DPCPP_SYCL_ARCH=sm_80`.
## Supported CUTLASS and CUTE Examples
Currently, not all CUTLASS and CUTE examples are supported with the SYCL backend.
as of now, the following are supported -

### CMake options for Intel devices
CUTE Examples <br>
* All the CUTE tutorials except `wgmma_sm90` is supported, of which
* `sgemm_1`, `sgemm_2`, and `tiled_copy` can run on any SYCL device
* `sgemm_sm80` and `sgemm_sm70` are Nvidia Ampere and Turing specific examples respectively.

* `DPCPP_SYCL_TARGET=intel_gpu_pvc` sets the triplet target for Intel PVC GPUs.
CUTLASS Examples <br>
* Example 14
* We also provide various SYCL examples for the Intel Data Center Max range of GPUs

## SYCL Supported Architectures
At the time of writing, the SYCL backend supports all Nvidia architectures till Ampere, and the
Intel Data Center Max series of GPUs is supported.

```bash
$ cmake -G Ninja .. \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=intel_gpu_pvc
```

# References

[1] https://www.khronos.org/sycl/
[1] https://www.khronos.org/sycl/