You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the original implementation, $16 \times 16$ was used as the basic unit for swizzle and stored as a nested layout in shared memory, where each swizzle block is stored contiguously. This allows the use of Swizzle<2, 3, 3> for address remapping and achieving bank conflict. The original layout is shown in the following diagram:
However, when considering the copy from GMEM to SMEM, we need to copy 128 Bytes along the column dimension to achieve the maximum memory access coalescing. In this case, it is not possible to tile the $16 \times 16$ matrix in contiguous memory.
In the new design, we plan to use Swizzle<3, 3, 3> as a basic swizzle block, which can distribute different threads within a transaction across different banks.
In this scenario, a $16 \times 16$ matrix of data will be distributed across two Swizzle<3, 3, 3> blocks. Therefore, the shared memory needs to keep track of how many swizzle blocks are stored within a single shared memory (using [kTM, kTN] as the shape of the shared memory) and map them to different swizzle blocks when indexing with (i, j).
The text was updated successfully, but these errors were encountered:
In the original implementation,$16 \times 16$ was used as the basic unit for swizzle and stored as a nested layout in shared memory, where each swizzle block is stored contiguously. This allows the use of
Swizzle<2, 3, 3>
for address remapping and achieving bank conflict. The original layout is shown in the following diagram:However, when considering the copy from GMEM to SMEM, we need to copy 128 Bytes along the column dimension to achieve the maximum memory access coalescing. In this case, it is not possible to tile the$16 \times 16$ matrix in contiguous memory.
In the new design, we plan to use
Swizzle<3, 3, 3>
as a basic swizzle block, which can distribute different threads within a transaction across different banks.In this scenario, a$16 \times 16$ matrix of data will be distributed across two
Swizzle<3, 3, 3>
blocks. Therefore, the shared memory needs to keep track of how many swizzle blocks are stored within a single shared memory (using[kTM, kTN]
as the shape of the shared memory) and map them to different swizzle blocks when indexing with(i, j)
.The text was updated successfully, but these errors were encountered: