-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initialize the document for shared memory layout.
- Loading branch information
Showing
3 changed files
with
20 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
[TBD] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
## Data Layout for Efficient Shared Memory Access | ||
|
||
### A Base Tile | ||
|
||
First, we define a `BaseTile`, which is a two-dimensional collection of data that threads within a single warp access cooperatively, with each thread issuing a single data access instruction. | ||
|
||
Let's take some examples. Suppose one thread accesses 128-bit data in a single access, and the data is in half-precision floating-point format. | ||
|
||
- If the threads in a warp are arranged in a $4 \times 8$ fashion, then the `BaseTile` would have a shape of $4 \times 64$. | ||
- If the threads in a warp are arranged in a $8 \times 4$ fashion, then the `BaseTile` would have a shape of $8 \times 32$. | ||
- If the threads in a warp are arranged in a $16 \times 2$ fashion, then the `BaseTile` would have a shape of $16 \times 16$. | ||
|
||
### Storing Tiles in Shared Memory | ||
|
||
To ensure an efficient access pattern, we need to impose a constraint by assuming that each thread accesses 128-bit data, which is the maximum width of a vectorized access instruction. Consequently, the entire warp accesses $4 \times 128$ bytes of data. It is known that 128 bytes is the largest transaction size. When more than 128 bytes of data per warp are loaded or stored, the GPU does not issue a single transaction but divides the data access into four transactions. Furthermore, bank conflicts occur per transaction. Our objective is to avoid bank conflicts when loading data tiles from or storing data tiles to shared memory. | ||
|
||
### The Swizzle Function and Swizzled Layout |