Skip to content

Commit

Permalink
Initialize the document for shared memory layout.
Browse files Browse the repository at this point in the history
  • Loading branch information
lcy-seso committed Jan 27, 2025
1 parent 25b3ce8 commit 04bd051
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 2 deletions.
4 changes: 2 additions & 2 deletions benchmarks/cpp/g2s_copy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ This preliminary test evaluates the performance of transferring a row-major data
Performance is assessed based on the total time required to complete the 100 data tile transfers.

### Implementations
The test includes implementations using TileFusion and CUTLASS, with no bank conflicts observed in the NVIDIA Compute Utility.
The test includes implementations using TileFusion and cutlass, with no bank conflicts observed in the NVIDIA Compute Utility. The cutlass implementation utilizes a copy plan that allows for maximal global memory coalescing to optimally utilize the global memory.

### Test Environment
- **GPU**: NVIDIA Tesla A100
- **CUDA Version**: 12.6

### Results

| Shape | Warp Layout | TileFusion (ms) | CUTLASS (ms) | Ratio |
| Shape | Warp Layout | TileFusion (ms) | cutlass (ms) | Ratio |
|:--------------------|:-----------:|:---------------:|:------------:|:------:|
| RowMajor (64, 64) | (1, 1) | 0.05044 | 0.05058 | 0.9974 |
| RowMajor (64, 64) | (2, 2) | 0.05309 | 0.05085 | 1.044 |
Expand Down
1 change: 1 addition & 0 deletions docs/_static/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[TBD]
17 changes: 17 additions & 0 deletions docs/tiles_in_shared_memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Data Layout for Efficient Shared Memory Access

### A Base Tile

First, we define a `BaseTile`, which is a two-dimensional collection of data that threads within a single warp access cooperatively, with each thread issuing a single data access instruction.

Let's take some examples. Suppose one thread accesses 128-bit data in a single access, and the data is in half-precision floating-point format.

- If the threads in a warp are arranged in a $4 \times 8$ fashion, then the `BaseTile` would have a shape of $4 \times 64$.
- If the threads in a warp are arranged in a $8 \times 4$ fashion, then the `BaseTile` would have a shape of $8 \times 32$.
- If the threads in a warp are arranged in a $16 \times 2$ fashion, then the `BaseTile` would have a shape of $16 \times 16$.

### Storing Tiles in Shared Memory

To ensure an efficient access pattern, we need to impose a constraint by assuming that each thread accesses 128-bit data, which is the maximum width of a vectorized access instruction. Consequently, the entire warp accesses $4 \times 128$ bytes of data. It is known that 128 bytes is the largest transaction size. When more than 128 bytes of data per warp are loaded or stored, the GPU does not issue a single transaction but divides the data access into four transactions. Furthermore, bank conflicts occur per transaction. Our objective is to avoid bank conflicts when loading data tiles from or storing data tiles to shared memory.

### The Swizzle Function and Swizzled Layout

0 comments on commit 04bd051

Please sign in to comment.