Initialize the document for shared memory layout.

microsoft · Jan 27, 2025 · 04bd051 · 04bd051
1 parent 25b3ce8
commit 04bd051
Show file tree

Hide file tree

Showing 3 changed files with 20 additions and 2 deletions.
diff --git a/benchmarks/cpp/g2s_copy/README.md b/benchmarks/cpp/g2s_copy/README.md
@@ -7,15 +7,15 @@ This preliminary test evaluates the performance of transferring a row-major data
 Performance is assessed based on the total time required to complete the 100 data tile transfers.
 
 ### Implementations
-The test includes implementations using TileFusion and CUTLASS, with no bank conflicts observed in the NVIDIA Compute Utility.
+The test includes implementations using TileFusion and cutlass, with no bank conflicts observed in the NVIDIA Compute Utility. The cutlass implementation utilizes a copy plan that allows for maximal global memory coalescing to optimally utilize the global memory.
 
 ### Test Environment
 - **GPU**: NVIDIA Tesla A100
 - **CUDA Version**: 12.6
 
 ### Results
 
-| Shape               | Warp Layout | TileFusion (ms) | CUTLASS (ms) | Ratio  |
+| Shape               | Warp Layout | TileFusion (ms) | cutlass (ms) | Ratio  |
 |:--------------------|:-----------:|:---------------:|:------------:|:------:|
 | RowMajor (64, 64)   |    (1, 1)   |      0.05044    |    0.05058   | 0.9974 |
 | RowMajor (64, 64)   |    (2, 2)   |      0.05309    |    0.05085   | 1.044  |

diff --git a/docs/_static/README.md b/docs/_static/README.md
@@ -0,0 +1 @@
+[TBD]
diff --git a/docs/tiles_in_shared_memory.md b/docs/tiles_in_shared_memory.md
@@ -0,0 +1,17 @@
+## Data Layout for Efficient Shared Memory Access
+
+### A Base Tile
+
+First, we define a `BaseTile`, which is a two-dimensional collection of data that threads within a single warp access cooperatively, with each thread issuing a single data access instruction.
+
+Let's take some examples. Suppose one thread accesses 128-bit data in a single access, and the data is in half-precision floating-point format.
+
+- If the threads in a warp are arranged in a $4 \times 8$ fashion, then the `BaseTile` would have a shape of $4 \times 64$.
+- If the threads in a warp are arranged in a $8 \times 4$ fashion, then the `BaseTile` would have a shape of $8 \times 32$.
+- If the threads in a warp are arranged in a $16 \times 2$ fashion, then the `BaseTile` would have a shape of $16 \times 16$.
+
+### Storing Tiles in Shared Memory
+
+To ensure an efficient access pattern, we need to impose a constraint by assuming that each thread accesses 128-bit data, which is the maximum width of a vectorized access instruction. Consequently, the entire warp accesses $4 \times 128$ bytes of data. It is known that 128 bytes is the largest transaction size. When more than 128 bytes of data per warp are loaded or stored, the GPU does not issue a single transaction but divides the data access into four transactions. Furthermore, bank conflicts occur per transaction. Our objective is to avoid bank conflicts when loading data tiles from or storing data tiles to shared memory.
+
+### The Swizzle Function and Swizzled Layout