Introduction

This is my GPU course final project in MICS600J. The main content is my attempt to implement the attention mechanism efficiently.

Paraments

In order to simplify the process, I replaced the original matrix dimensions [batch_size, nheads, seq_len, headdim] with [seq_len, headdim].

N: seq_len d: headdim

The attention mechanism is well known and I won’t go into details.

Use Tensor core to compute GEMM.
Use Asynchronous transfer to overlap computation and communication(transfer data from global memory to shared memory).
Bank Conflict free.

Use make command to build program.

Fine-tuning Llama-2-7B, when using Sparse Attention Mechanism, we found that accuracy can be improved and restored with little overhead.

Kernel Fusion, just like Flash attention.
Sparse Attention Mechanism, just like DFSS, make full use of sparse tensor core.