Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README #137

Open
wants to merge 13 commits into
base: sycl-develop
Choose a base branch
from
68 changes: 0 additions & 68 deletions README-sycl.md

This file was deleted.

6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ and improves code composability and readability. More documentation specific to

In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

## CUTLASS with SYCL
CUTLASS 3.0 API now also supports SYCL, and can run on Nvidia(upto the Ampere architecture) and Intel Xe Core architecture GPUs using the SYCL backend using the Intel open source `DPC++` compiler.
AD2605 marked this conversation as resolved.
Show resolved Hide resolved
The support is currently limited to GEMMs only. See [Quick Start Guide](./media/docs/build/building_with_sycl_support.md) on how to build and run
examples using the SYCL backend.

To learn more about Intel SYCL compilers and SYCL, please see [Intel oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html)

# What's New in CUTLASS 3.5

Expand Down
115 changes: 45 additions & 70 deletions media/docs/build/building_with_sycl_support.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@

[README](../../../README.md#documentation) > **CUTLASS 3: Building with Clang as Host Compiler**

# Building with SYCL Support

Cutlass 3 can be built with SYCL using the DPC++ compiler, enabling Cutlass
Expand All @@ -12,93 +9,71 @@ code for heterogeneous and offload processors to be written with modern
ISO C++. It provides APIs and abstractions to find devices and manage
resources for GPUs.

## Limitations

Currently, it's only possible to build five examples in the Cute
tutorial and a reduced number of Cute tests.

### Cute Tutorial Examples Supported
## Installing the DPC++ Compiler
One can use the nightly version of the DPC++ compiler, of which the prebuilt packages can be
procured from [here](https://github.com/intel/llvm/releases), or build it from source as [described](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md).

* `sgemm_1`, `sgemm_2` and `tiled_copy`: Generic examples that should run on any
SYCL-enabled device. Tested on Nvidia SM80 devices and Intel PVC
and Arc devices.
* `sgemm_sm70`: Nvidia SM70 specific example.
* `sgemm_sm80`: Nvidia SM80 specific example.
If building from source for the CUDA backend, a minimum CUDA toolkit version of 12.3 is recommended.
If using the pre-built nightlies, a nightly dated no older than 2024-08-02 is required.

## Software Prerequisites
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
## Building and Running on the SYCL backend
To build with the Intel open source `DPC++` compiler when using the SYCL backend
```bash
$ mkdir build && cd build

To build CUTLASS with SYCL support, you need the latest version of
the DPC++ compiler. You can either use a recent
[nightly build](https://github.com/intel/llvm/releases)
(see the [Setup DPC++ Nightly Build](#setup-dpc-nightly-build) section)
or build the compiler from source as described in the
[oneAPI DPC++ guideline](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md#build-dpc-toolchain-with-support-for-nvidia-cuda).
$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80 .. # compiles for the NVIDIA Ampere GPU architecture

If building for an Nvidia GPU, the CUDA Toolkit will be required
(tested with version 12.4).
$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc .. # compiles for the Intel Xe Core Architecture
```
A complete example can be as follows (running on the Intel Data Center Max 1100) -

CMake (at least version 3.18), Ninja, git, and Python
(at least version 3.6) are also required.
```bash
$ cmake -DCMAKE_CXX_COMPILER=/path/to/dpcpp/clang++ -DCMAKE_C_COMPILER=/path/to/dpcpp/clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved

### Setup DPC++ Nightly Build
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
$ make pvc_gemm

To install the nightly build, download DPC++ from the
[nightly build](https://github.com/intel/llvm/releases). The minimum version
required is nightly 2024-07-19.
$ ./examples/sycl/pvc/pvc_gemm

```bash
$ export DPCPP_HOME=/opt/intel/dpcpp-nightly
$ mkdir -p $DPCPP_HOME
$ cd $DPCPP_HOME
$ wget https://github.com/intel/llvm/releases/download/nightly-2024-07-19/sycl_linux.tar.gz
$ tar -zxvf sycl_linux.tar.gz
$ export PATH=$DPCPP_HOME/bin:$PATH
$ export LD_LIBRARY_PATH=$DPCPP_HOME/lib:$LD_LIBRARY_PATH
$ export C_INCLUDE_PATH=$DPCPP_HOME/include:$C_INCLUDE_PATH
$ export CPLUS_INCLUDE_PATH=$DPCPP_HOME/include:$CPLUS_INCLUDE_PATH
$ export CC=clang
$ export CXX=clang++
Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [225.773]TFlop/s (0.7609)ms
```
More examples on the Intel GPU can be found in the [sycl example folder](../../examples/sycl/pvc/)

# Running CMake

## Required CMake options

The SYCL build requires specifying the following CMake options and
environmental variables. Replace `<path-to-clang>` and `<path-to-clang++>`
with the path to your clang and clang++ executables. You may use `clang`
and `clang++` directly if they are in your `PATH`.
A complete example when running on a A100, using the SYCL backend

```bash
$ export CC=<path-to-clang>/clang
$ export CXX=<path-to-clang++>/clang++
```
$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80

### CMake options for Nvidia devices
$ make 14_ampere_tf32_tensorop_gemm_cute

* `DPCPP_SYCL_TARGET=nvptx64-nvidia-cuda` sets the triplet target for Nvidia GPUs.
* `DPCPP_SYCL_ARCH=sm_80` sets the device architecture to SM_80.
$ ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute

```bash
$ cmake -G Ninja .. \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda \
-DDPCPP_SYCL_ARCH=sm_80
Disposition: Passed
Problem Size: 5120x4096x4096x1
Avg runtime: 1.5232 ms
GFLOPS: 112788
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
```

Building Cutlass with SYCL support on Nvidia devices has been tested
on an A100 device with `DPCPP_SYCL_ARCH=sm_80`.
## Supported CUTLASS and CUTE Examples
Currently, not all CUTLASS and CUTE examples are supported with the SYCL backend.
as of now, the following are supported -

### CMake options for Intel devices
CUTE Examples <br>
* All the CUTE tutorials except `wgmma_sm90` is supported, of which
* `sgemm_1`, `sgemm_2`, and `tiled_copy` can run on any SYCL device
* `sgemm_sm80` and `sgemm_sm70` are Nvidia Ampere and Turing specific examples respectively.

* `DPCPP_SYCL_TARGET=intel_gpu_pvc` sets the triplet target for Intel PVC GPUs.
CUTLASS Examples <br>
* Example 14
* We also provide various SYCL examples for the Intel Data Center Max range of GPUs

## SYCL Supported Architectures and APIs
AD2605 marked this conversation as resolved.
Show resolved Hide resolved
At the time of writing, the SYCL backend supports all Nvidia architectures till Ampere, and the
Intel Data Center Max series of GPUs is supported.

```bash
$ cmake -G Ninja .. \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=intel_gpu_pvc
```
We support the `CollectiveMMA` and the collective builder APIs for the same.
AD2605 marked this conversation as resolved.
Show resolved Hide resolved

# References

[1] https://www.khronos.org/sycl/
[1] https://www.khronos.org/sycl/
45 changes: 44 additions & 1 deletion media/docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,16 @@ CUTLASS requires:
- CMake 3.18+
- host compiler supporting C++17 or greater (minimum g++ 7.5.0)
- Python 3.6+
- For the SYCL backend, an installation of the open source `DPC++` compiler, which
rolandschulz marked this conversation as resolved.
Show resolved Hide resolved
can be found [here](https://github.com/intel/llvm)

CUTLASS may be optionally compiled and linked with
- cuBLAS
- cuDNN v7.6 or later

## Initial build steps

Construct a build directory and run CMake.
Construct a build directory and run CMake if using the CUDA toolchain.
```bash
$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

Expand All @@ -27,6 +29,47 @@ $ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=90a # compiles for NVIDIA Hopper GPU architecture
```

## Building and Running on the SYCL backend
To build with the Intel open source `DPC++` compiler when using the SYCL backend
```bash
$ mkdir build && cd build

$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80 .. # compiles for the NVIDIA Ampere GPU architecture
rolandschulz marked this conversation as resolved.
Show resolved Hide resolved

$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc .. # compiles for the Intel Xe Core Architecture
AD2605 marked this conversation as resolved.
Show resolved Hide resolved
```
A complete example can be as follows (running on the Intel Data Center Max 1100) -

```bash
$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc ..

$ make pvc_gemm

$ ./examples/sycl/pvc/pvc_gemm

Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [225.773]TFlop/s (0.7609)ms
aacostadiaz marked this conversation as resolved.
Show resolved Hide resolved
```
More examples on the Intel GPU can be found in the [sycl example folder](../../examples/sycl/pvc/)

A complete example when running on a A100, using the SYCL backend

```bash
$ cmake -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DDPCPP_SYCL_ARCH=sm_80

$ make 14_ampere_tf32_tensorop_gemm_cute

$ ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute

Disposition: Passed
Problem Size: 5120x4096x4096x1
Avg runtime: 1.5232 ms
GFLOPS: 112788
```

### CUTLASS quick building tips

If your goal is strictly to build only the CUTLASS Profiler and to minimize compilation time, we suggest
executing the following CMake command in an empty `build/` directory.
```bash
Expand Down