Initialize the document for shared memory layout.

microsoft · Jan 27, 2025 · ac15cc1 · ac15cc1
1 parent 25b3ce8
commit ac15cc1
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 2 deletions.
diff --git a/benchmarks/cpp/g2s_copy/README.md b/benchmarks/cpp/g2s_copy/README.md
@@ -7,15 +7,15 @@ This preliminary test evaluates the performance of transferring a row-major data
 Performance is assessed based on the total time required to complete the 100 data tile transfers.
 
 ### Implementations
-The test includes implementations using TileFusion and CUTLASS, with no bank conflicts observed in the NVIDIA Compute Utility.
+The test includes implementations using TileFusion and cutlass, with no bank conflicts observed in the NVIDIA Compute Utility. The cutlass implementation utilizes a copy plan that allows for maximal global memory coalescing to optimally utilize the global memory.
 
 ### Test Environment
 - **GPU**: NVIDIA Tesla A100
 - **CUDA Version**: 12.6
 
 ### Results
 
-| Shape               | Warp Layout | TileFusion (ms) | CUTLASS (ms) | Ratio  |
+| Shape               | Warp Layout | TileFusion (ms) | cutlass (ms) | Ratio  |
 |:--------------------|:-----------:|:---------------:|:------------:|:------:|
 | RowMajor (64, 64)   |    (1, 1)   |      0.05044    |    0.05058   | 0.9974 |
 | RowMajor (64, 64)   |    (2, 2)   |      0.05309    |    0.05085   | 1.044  |

diff --git a/docs/_static/README.md b/docs/_static/README.md
@@ -0,0 +1 @@
+[TBD]
diff --git a/docs/tiles_in_shared_memory.md b/docs/tiles_in_shared_memory.md
@@ -0,0 +1,25 @@
+## Data Layout for Efficient Shared Memory Access
+
+### A Base Tile
+
+A `BaseTile` is a two-dimensional collection of data accessed cooperatively by threads within a single warp, with each thread issuing a single data access instruction.
+
+Let’s consider some specific examples. Suppose each thread accesses 128-bit data in a single access, and the threads are arranged within the warp in a row-major fashion, where threads along the rows have consecutive thread indices.
+
+If the data is in ***half-precision*** floating-point format:
+
+- When the threads in a warp are arranged in a $4 \times 8$ configuration, the `BaseTile` has dimensions of $4 \times 64$.
+- When the threads in a warp are arranged in an $8 \times 4$ configuration, the `BaseTile` has dimensions of $8 \times 32$.
+- When the threads in a warp are arranged in a $16 \times 2$ configuration, the `BaseTile` has dimensions of $16 \times 16$.
+
+Now, suppose the data is in ***single-precision*** floating-point format:
+
+- When the threads in a warp are arranged in a $4 \times 8$ configuration, the `BaseTile` has dimensions of $4 \times 32$.
+- When the threads in a warp are arranged in an $8 \times 4$ configuration, the `BaseTile` has dimensions of $8 \times 16$.
+- When the threads in a warp are arranged in a $16 \times 2$ configuration, the `BaseTile` has dimensions of $16 \times 8$.
+
+### Storing Tiles in Shared Memory
+
+To ensure an efficient access pattern, we need to impose a constraint by assuming that each thread accesses 128-bit data, which is the maximum width of a vectorized access instruction. Consequently, the entire warp accesses $4 \times 128$ bytes of data. It is known that 128 bytes is the largest transaction size. When more than 128 bytes of data per warp are loaded or stored, the GPU does not issue a single transaction but divides the data access into four transactions. Furthermore, bank conflicts occur per transaction. Our objective is to avoid bank conflicts when loading data tiles from or storing data tiles to shared memory.
+
+### The Swizzle Function and Swizzled Layout