Skip to content

Commit

Permalink
Initialize the document for shared memory layout.
Browse files Browse the repository at this point in the history
  • Loading branch information
lcy-seso committed Jan 27, 2025
1 parent 25b3ce8 commit ac15cc1
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 2 deletions.
4 changes: 2 additions & 2 deletions benchmarks/cpp/g2s_copy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ This preliminary test evaluates the performance of transferring a row-major data
Performance is assessed based on the total time required to complete the 100 data tile transfers.

### Implementations
The test includes implementations using TileFusion and CUTLASS, with no bank conflicts observed in the NVIDIA Compute Utility.
The test includes implementations using TileFusion and cutlass, with no bank conflicts observed in the NVIDIA Compute Utility. The cutlass implementation utilizes a copy plan that allows for maximal global memory coalescing to optimally utilize the global memory.

### Test Environment
- **GPU**: NVIDIA Tesla A100
- **CUDA Version**: 12.6

### Results

| Shape | Warp Layout | TileFusion (ms) | CUTLASS (ms) | Ratio |
| Shape | Warp Layout | TileFusion (ms) | cutlass (ms) | Ratio |
|:--------------------|:-----------:|:---------------:|:------------:|:------:|
| RowMajor (64, 64) | (1, 1) | 0.05044 | 0.05058 | 0.9974 |
| RowMajor (64, 64) | (2, 2) | 0.05309 | 0.05085 | 1.044 |
Expand Down
1 change: 1 addition & 0 deletions docs/_static/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[TBD]
25 changes: 25 additions & 0 deletions docs/tiles_in_shared_memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## Data Layout for Efficient Shared Memory Access

### A Base Tile

A `BaseTile` is a two-dimensional collection of data accessed cooperatively by threads within a single warp, with each thread issuing a single data access instruction.

Let’s consider some specific examples. Suppose each thread accesses 128-bit data in a single access, and the threads are arranged within the warp in a row-major fashion, where threads along the rows have consecutive thread indices.

If the data is in ***half-precision*** floating-point format:

- When the threads in a warp are arranged in a $4 \times 8$ configuration, the `BaseTile` has dimensions of $4 \times 64$.
- When the threads in a warp are arranged in an $8 \times 4$ configuration, the `BaseTile` has dimensions of $8 \times 32$.
- When the threads in a warp are arranged in a $16 \times 2$ configuration, the `BaseTile` has dimensions of $16 \times 16$.

Now, suppose the data is in ***single-precision*** floating-point format:

- When the threads in a warp are arranged in a $4 \times 8$ configuration, the `BaseTile` has dimensions of $4 \times 32$.
- When the threads in a warp are arranged in an $8 \times 4$ configuration, the `BaseTile` has dimensions of $8 \times 16$.
- When the threads in a warp are arranged in a $16 \times 2$ configuration, the `BaseTile` has dimensions of $16 \times 8$.

### Storing Tiles in Shared Memory

To ensure an efficient access pattern, we need to impose a constraint by assuming that each thread accesses 128-bit data, which is the maximum width of a vectorized access instruction. Consequently, the entire warp accesses $4 \times 128$ bytes of data. It is known that 128 bytes is the largest transaction size. When more than 128 bytes of data per warp are loaded or stored, the GPU does not issue a single transaction but divides the data access into four transactions. Furthermore, bank conflicts occur per transaction. Our objective is to avoid bank conflicts when loading data tiles from or storing data tiles to shared memory.

### The Swizzle Function and Swizzled Layout

0 comments on commit ac15cc1

Please sign in to comment.