Skip to content

toastts/nv-kernel-benchmarks

Repository files navigation

CUTLASS vs cuBLAS kernel implementation benchmarks

setup on SOL supercomputer

  • reserve hardware interactive -c 4 -G a100:1 -t 120 --mem=32G
  • load modules
    • ml cuda-12.6.1-gcc-12.1.0
    • ml mamba/latest
  • set nvidia compiler env var export CUDACXX=${CUDA_HOME}/bin/nvcc
  • run nvidia-smi for device info dump, make sure A100 is listed

setup locally

  • install the cuda toolkit
  • figure out your $CUDAPATH (/opt/cuda on many systems)
  • check CMakeLists.txt and look at the line set(CMAKE_CUDA_ARCHITECTURES 89), make sure your GPU compute compatability matches here
    • this is 8.9 formatted as 89, so 8.0 would be 80 for reference

building

  • mkdir build
  • cmake -B build/ -S .
  • now run ./build/gemm [0-5]
    • 0 is cuBLAS
    • 1 is simple gemm
    • 2 is global mem coalescing
    • 3 is 1D blocktiling
    • 4 is 2D blocktiling
    • 5 is vectorized mem accesses

resources

About

cuda kernel benchmarks, comparing CUTLASS kernels with custom optimizations to cuBLAS fp32

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •