Name		Name	Last commit message	Last commit date
parent directory ..
tests		tests
CMakeLists.txt		CMakeLists.txt
README.md		README.md
cute_general_matrix_multiplication.cuh		cute_general_matrix_multiplication.cuh
cute_general_matrix_multiplication.hpp		cute_general_matrix_multiplication.hpp
cute_general_matrix_multiplication_naive.cu		cute_general_matrix_multiplication_naive.cu
cute_general_matrix_multiplication_naive_gmem_tiled_copy_tiled_mma.cu		cute_general_matrix_multiplication_naive_gmem_tiled_copy_tiled_mma.cu
cute_general_matrix_multiplication_naive_gmem_tiled_copy_tiled_mma_sm70_pipeline.cu		cute_general_matrix_multiplication_naive_gmem_tiled_copy_tiled_mma_sm70_pipeline.cu
cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_smem_tiled_copy_tiled_mma.cu		cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_smem_tiled_copy_tiled_mma.cu
cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_smem_tiled_copy_tiled_mma_sm80_pipeline.cu		cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_smem_tiled_copy_tiled_mma_sm80_pipeline.cu
cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_tiled_mma.cu		cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_tiled_mma.cu
cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_tiled_mma_sm80_pipeline.cu		cute_general_matrix_multiplication_tensor_core_gmem_tiled_copy_tiled_mma_sm80_pipeline.cu

README.md

CuTe General Matrix Multiplication

Introduction

These examples demonstrate the implementation of general matrix transpose kernels using the CuTe. They follow the cuBLAS interface design and supports matrices that are transposed or not transposed. The naive kernels perform boundary checks and can be used for any matrix dimension size. The kernels that performs tiled copy assumes each matrix dimension size is a multiple of certain size, usually 32, depending on the data type.

The performance was mainly optimized for the NT case where the matrix A is a M x K column-major matrix, the matrix B is a K x N row-major matrix, and the matrix C is a M x N column-major matrix. I will try coming up with a general implementation that optimizes for all the NN, NT, TN, and TT cases in the future. An idea is to use shared memory layouts that are compatible with the global memory layouts so that global memory vectorized access can always be enabled.

Usages

Build Specialized CUDA Kernels

High performance CUDA kernels are usually implemented in a way that is specialized for a certain problem sizes. Generic CUDA kernels might not be able to achieve the best performance. So instead of using one generic CUDA kernel to solve all the problems, accelerated computing libraries usually provide a set of specialized CUDA kernels that are optimized for different problem sizes or have some strict assumptions and requirements on the problem sizes.

To build specialized CUDA kernels for performance measurements, please run the following commands.

$ export NUM_CMAKE_JOBS=4
$ cmake -B build -DNO_BOUNDS_CHECK=ON
$ cmake --build build --config Release --parallel ${NUM_CMAKE_JOBS}

To build generic CUDA kernels which are suitable for a wide range of problem sizes, please run the following commands.

$ export NUM_CMAKE_JOBS=4
$ cmake -B build -DNO_BOUNDS_CHECK=OFF
$ cmake --build build --config Release --parallel ${NUM_CMAKE_JOBS}

Run Unit Tests

$ ctest --test-dir build/ --tests-regex "TestAllGeneralMatrixMultiplication.*" --verbose
The following tests passed:
        TestAllGeneralMatrixMultiplicationNaive
        TestAllGeneralMatrixMultiplicationNaiveTiledCopyTiledMma
        TestAllGeneralMatrixMultiplicationTensorCoreTiledCopyTiledMma

100% tests passed, 0 tests failed out of 3

Run Performance Measurement

The performance were measured using specialized CUDA kernels.

$ ctest --test-dir build/ --tests-regex "ProfileAllGeneralMatrixMultiplication.*" --verbose

Only the general matrix multiplication kernels that consume half typed data and compute in half precision are documented here for a problem size of 4096 x 4096 x 4096 on NVIDIA GeForce RTX 3090.

Kernel Name	Trans A	Trans B	Latency (ms)	TOPs	Performance VS cuBLAS (%)
cuBLAS	N	T	1.12927	121.706	-
Naive	N	T	15.1479	9.07312	6.96212
Gmem Tiled Copy + Tiled MMA	N	T	8.02777	17.1204	14.067
Gmem Tiled Copy + Tiled MMA + SM80 Tensor Core	N	T	2.29437	59.9026	56.0556
Gmem Tiled Copy + Tiled MMA + SM80 Tensor Core + SM80 Pipeline	N	T	1.63768	83.9228	77.8466
Gmem Tiled Copy + Smem Tiled Copy + Tiled MMA + SM80 Tensor Core	N	T	1.93464	71.041	65.1556
Gmem Tiled Copy + Smem Tiled Copy + Tiled MMA + SM80 Tensor Core + SM80 Pipeline	N	T	1.16808	117.663	108.67

Run Nsight Compute Profiling

for file in build/examples/cute_general_matrix_multiplication/tests/profile_*; do
    filename=$(basename -- "$file")
    ncu --set full -f -o ncu_reports/"$filename" "$file"
done

Run Compute Sanitizer

for file in build/examples/cute_general_matrix_multiplication/tests/test_*; do
    compute-sanitizer --leak-check full "$file"
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cute_general_matrix_multiplication

cute_general_matrix_multiplication

README.md

CuTe General Matrix Multiplication

Introduction

Usages

Build Specialized CUDA Kernels

Run Unit Tests

Run Performance Measurement

Run Nsight Compute Profiling

Run Compute Sanitizer

References

Files

cute_general_matrix_multiplication

Directory actions

More options

Directory actions

More options

Latest commit

History

cute_general_matrix_multiplication

Folders and files

parent directory

README.md

CuTe General Matrix Multiplication

Introduction

Usages

Build Specialized CUDA Kernels

Run Unit Tests

Run Performance Measurement

Run Nsight Compute Profiling

Run Compute Sanitizer

References