Skip to content

nlesc-recruit/ccglib

Repository files navigation

ccglib

DOI CI status linting CI status cuda CI status hip

The Complex Common GEMM Library (ccglib) provides a simple C++ interface for complex-valued matrix multiplication on GPU tensor and matrix cores, supporting both CUDA and HIP.

Requirements

  • NVIDIA: Any GPU with tensor cores and support for asynchronous memory copies, i.e. Ampere generation or newer.
  • AMD: Any GPU with matrix cores, i.e. CDNA1 or newer, RDNA3 or newer.
Software Minimum version
CUDA 11.0
ROCm 6.1
CMake 3.20

Note: Certain input/output types are only supported by specific GPU architectures, see the table below for details.

Installation

CMake is used to build ccglib. It can either be built as a library, or used in another project as external dependency through CMake. To build ccglib locally, run:

git clone https://git.astron.nl/RD/recruit/ccglib
cd ccglib
cmake -S . -B build
make -C build

To use ccglib as an external dependency, add the following to the CMakeLists.txt file of your project:

include(FetchContent)

FetchContent_Declare(
  ccglib
  GIT_REPOSITORY https://git.astron.nl/RD/recruit/ccglib
  GIT_TAG main)
FetchContent_MakeAvailable(ccglib)

Then link ccglib into your executable or library with:

target_link_libraries(<your_target> ccglib)

The following build options are available:

Option Description Default
CCGLIB_BACKEND GPU backend API to use, either CUDA or HIP CUDA
CCGLIB_BUILD_TESTING Build the test suite. In HIP mode, it may be required to use hipcc as the host compiler. OFF
CCGLIB_BUILD_BENCHMARK Build the benchmark suite OFF
CCGLIB_BENCHMARK_WITH_PMT Enable Power Measurement Toolkit support in the benchmark suite OFF

Supported data types and matrix layouts

ccglib supports a range of input/output data types, depending on the available hardware:

Input type Output type NVIDIA AMD Notes
float8e4m3 float32 Ada or newer CDNA3 and RDNA4 only On AMD, only RDNA4 implements float8 in hardware
bfloat16 bfloat16/float32 float32 output only -
float16 float32/float16 -
float32 float32/bfloat16/float16* CDNA only
tensorfloat float32/float16* Ampere or newer Input data must be in float32 format, conversion to tensorfloat is automatic
int1 int32 Input bits must be packed into int32 values. ccglib provides a tool to do this

* bfloat16/float16 output is native float32 output downcasted to bfloat16/float16.

With matrix-matrix multiplication defined as C = A x B, ccglib requires the A matrix to be in row-major format and the B matrix to be in column-major format. The C matrix can be either row-major or column-major.

The real and imaginary samples can either be interleaved (i.e. the complex axis is the fastest changing axis) or planar (i.e. the complex axis is the slowest changing axis of a single matrix).

Two variants of the GEMM are provided: a basic and an optimized version. The basic GEMM requires the input to be in planar format and the output is planar as well. The optimized GEMM has a complicated input format. A transpose operation is provided to convert input matrices of either interleaved or planar format to the format required by the optimized GEMM. The output can be either planar or interleaved, with planar providing the best performance.

ccglib supports running multiple GEMM operations at once using a batch size parameter. The matrices must be stored contiguous in device memory. The output will be a set of matrices contiguous in memory as well.

As as example, consider a row-major A matrix of M rows and K colums, a column-major B matrix of K rows and N columns, and a resulting row-major C matrix of M rows and N columns. With planar complex samples, the shapes of the matrices for a basic GEMM are as follows:

  • A: BATCH x COMPLEX x M x K
  • B: BATCH x COMPLEX x N x K
  • C: BATCH x COMPLEX x M x N

Example usage

Refer to the examples folder for typical usage examples.

ccglib uses cudawrappers to provide a unified interface to CUDA and HIP. Refer to the cudawrappers documentation for more details.

About

Complex Common GEMM Library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •